[PATCH v4 0/3] Optimizations for khugepaged

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] Optimizations for khugepaged
@ 2025-07-24  5:22 Dev Jain
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dev Jain @ 2025-07-24  5:22 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().

For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will iterate
through all 16 entries to collect a/d bits. Next, ptep_clear() will cause
a TLBI for every contig block in the range via contpte_try_unfold().
Instead, use clear_ptes() to only do the TLBI at the first and last
contig block of the range.

For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement
is expected due to a reduction in the number of function calls and
batching atomic operations.

---
Rebased on today's mm-new - the v3 of this patchset was already in, so
I reverted those commits and then rebased on top of that.
mm-selftests pass.

v3->v4:
 - Use unsigned int for nr_ptes and max_nr_ptes (David)
 - Define the functions in patch 1 as inline functions with kernel docs
   instead of macros (akpm)

v2->v3:
 - Drop patch 3 (was merged separately)
 - Add patch 1 (David)
 - Coding style change, drop mapped_folio (Lorenzo)

v1->v2:
 - Use for loop instead of do-while loop (Lorenzo)
 - Remove folio_test_large check since the subpage-check condition
   will imply that (Baolin)
 - Combine patch 1 and 2 into this series, add new patch 3

David Hildenbrand (1):
  mm: add get_and_clear_ptes() and clear_ptes()

Dev Jain (2):
  khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE
    batching
  khugepaged: Optimize collapse_pte_mapped_thp() by PTE batching

 arch/arm64/mm/mmu.c     |  2 +-
 include/linux/pgtable.h | 45 ++++++++++++++++++++++++++++++++
 mm/khugepaged.c         | 58 +++++++++++++++++++++++++++--------------
 mm/mremap.c             |  2 +-
 mm/rmap.c               |  2 +-
 5 files changed, 87 insertions(+), 22 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes()
  2025-07-24  5:22 [PATCH v4 0/3] Optimizations for khugepaged Dev Jain
@ 2025-07-24  5:22 ` Dev Jain
  2025-07-24  9:31   ` Barry Song
                     ` (2 more replies)
  2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
  2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
  2 siblings, 3 replies; 16+ messages in thread
From: Dev Jain @ 2025-07-24  5:22 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

From: David Hildenbrand <david@redhat.com>

Let's add variants to be used where "full" does not apply -- which will
be the majority of cases in the future. "full" really only applies if
we are about to tear down a full MM.

Use get_and_clear_ptes() in existing code, clear_ptes() users will
be added next.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/mm/mmu.c     |  2 +-
 include/linux/pgtable.h | 45 +++++++++++++++++++++++++++++++++++++++++
 mm/mremap.c             |  2 +-
 mm/rmap.c               |  2 +-
 4 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index abd9725796e9..20a89ab97dc5 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1528,7 +1528,7 @@ early_initcall(prevent_bootmem_remove_init);
 pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t *ptep, unsigned int nr)
 {
-	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, /* full = */ 0);
+	pte_t pte = get_and_clear_ptes(vma->vm_mm, addr, ptep, nr);
 
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e3b99920be05..4c035637eeb7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -736,6 +736,29 @@ static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
 }
 #endif
 
+/**
+ * get_and_clear_ptes - Clear present PTEs that map consecutive pages of
+ *			the same folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ *
+ * Use this instead of get_and_clear_full_ptes() if it is known that we don't
+ * need to clear the full mm, which is mostly the case.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_ptes(struct mm_struct *mm, unsigned long addr,
+		pte_t *ptep, unsigned int nr)
+{
+	return get_and_clear_full_ptes(mm, addr, ptep, nr, 0);
+}
+
 #ifndef clear_full_ptes
 /**
  * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
@@ -768,6 +791,28 @@ static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/**
+ * clear_ptes - Clear present PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ *
+ * Use this instead of clear_full_ptes() if it is known that we don't need to
+ * clear the full mm, which is mostly the case.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void clear_ptes(struct mm_struct *mm, unsigned long addr,
+		pte_t *ptep, unsigned int nr)
+{
+	clear_full_ptes(mm, addr, ptep, nr, 0);
+}
+
 /*
  * If two threads concurrently fault at the same page, the thread that
  * won the race updates the PTE and its local TLB/Cache. The other thread
diff --git a/mm/mremap.c b/mm/mremap.c
index ac39845e9718..677a4d744df9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -280,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 							 old_pte, max_nr_ptes);
 			force_flush = true;
 		}
-		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
+		pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..568198e9efc2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2036,7 +2036,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			flush_cache_range(vma, address, end_addr);
 
 			/* Nuke the page table entry. */
-			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
+			pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
 			/*
 			 * We clear the PTE but do not flush so potentially
 			 * a remote CPU could still be writing to the folio.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes()
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
@ 2025-07-24  9:31   ` Barry Song
  2025-07-24 17:17   ` Lorenzo Stoakes
  2025-07-29 14:30   ` Zi Yan
  2 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2025-07-24  9:31 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 1:23 PM Dev Jain <dev.jain@arm.com> wrote:
>
> From: David Hildenbrand <david@redhat.com>
>
> Let's add variants to be used where "full" does not apply -- which will
> be the majority of cases in the future. "full" really only applies if
> we are about to tear down a full MM.
>
> Use get_and_clear_ptes() in existing code, clear_ptes() users will
> be added next.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes()
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
  2025-07-24  9:31   ` Barry Song
@ 2025-07-24 17:17   ` Lorenzo Stoakes
  2025-07-29 14:30   ` Zi Yan
  2 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 17:17 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 10:52:59AM +0530, Dev Jain wrote:
> From: David Hildenbrand <david@redhat.com>
>
> Let's add variants to be used where "full" does not apply -- which will
> be the majority of cases in the future. "full" really only applies if
> we are about to tear down a full MM.
>
> Use get_and_clear_ptes() in existing code, clear_ptes() users will
> be added next.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/arm64/mm/mmu.c     |  2 +-
>  include/linux/pgtable.h | 45 +++++++++++++++++++++++++++++++++++++++++
>  mm/mremap.c             |  2 +-
>  mm/rmap.c               |  2 +-
>  4 files changed, 48 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index abd9725796e9..20a89ab97dc5 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1528,7 +1528,7 @@ early_initcall(prevent_bootmem_remove_init);
>  pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t *ptep, unsigned int nr)
>  {
> -	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, /* full = */ 0);
> +	pte_t pte = get_and_clear_ptes(vma->vm_mm, addr, ptep, nr);
>
>  	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>  		/*
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e3b99920be05..4c035637eeb7 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -736,6 +736,29 @@ static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  #endif
>
> +/**
> + * get_and_clear_ptes - Clear present PTEs that map consecutive pages of
> + *			the same folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + *
> + * Use this instead of get_and_clear_full_ptes() if it is known that we don't
> + * need to clear the full mm, which is mostly the case.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_ptes(struct mm_struct *mm, unsigned long addr,
> +		pte_t *ptep, unsigned int nr)
> +{
> +	return get_and_clear_full_ptes(mm, addr, ptep, nr, 0);
> +}
> +
>  #ifndef clear_full_ptes
>  /**
>   * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
> @@ -768,6 +791,28 @@ static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +/**
> + * clear_ptes - Clear present PTEs that map consecutive pages of the same folio.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + *
> + * Use this instead of clear_full_ptes() if it is known that we don't need to
> + * clear the full mm, which is mostly the case.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void clear_ptes(struct mm_struct *mm, unsigned long addr,
> +		pte_t *ptep, unsigned int nr)
> +{
> +	clear_full_ptes(mm, addr, ptep, nr, 0);
> +}
> +
>  /*
>   * If two threads concurrently fault at the same page, the thread that
>   * won the race updates the PTE and its local TLB/Cache. The other thread
> diff --git a/mm/mremap.c b/mm/mremap.c
> index ac39845e9718..677a4d744df9 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -280,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  							 old_pte, max_nr_ptes);
>  			force_flush = true;
>  		}
> -		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> +		pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);
>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f93ce27132ab..568198e9efc2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2036,7 +2036,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			flush_cache_range(vma, address, end_addr);
>
>  			/* Nuke the page table entry. */
> -			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> +			pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>  			/*
>  			 * We clear the PTE but do not flush so potentially
>  			 * a remote CPU could still be writing to the folio.
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes()
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
  2025-07-24  9:31   ` Barry Song
  2025-07-24 17:17   ` Lorenzo Stoakes
@ 2025-07-29 14:30   ` Zi Yan
  2 siblings, 0 replies; 16+ messages in thread
From: Zi Yan @ 2025-07-29 14:30 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 24 Jul 2025, at 1:22, Dev Jain wrote:

> From: David Hildenbrand <david@redhat.com>
>
> Let's add variants to be used where "full" does not apply -- which will
> be the majority of cases in the future. "full" really only applies if
> we are about to tear down a full MM.
>
> Use get_and_clear_ptes() in existing code, clear_ptes() users will
> be added next.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/mm/mmu.c     |  2 +-
>  include/linux/pgtable.h | 45 +++++++++++++++++++++++++++++++++++++++++
>  mm/mremap.c             |  2 +-
>  mm/rmap.c               |  2 +-
>  4 files changed, 48 insertions(+), 3 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24  5:22 [PATCH v4 0/3] Optimizations for khugepaged Dev Jain
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
@ 2025-07-24  5:23 ` Dev Jain
  2025-07-24 17:55   ` Lorenzo Stoakes
  2025-07-29 14:38   ` Zi Yan
  2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
  2 siblings, 2 replies; 16+ messages in thread
From: Dev Jain @ 2025-07-24  5:23 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching refcount-mapcount manipulation on
the folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a55fb1dcd224..f23e943506bc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						spinlock_t *ptl,
 						struct list_head *compound_pagelist)
 {
+	unsigned long end = address + HPAGE_PMD_SIZE;
 	struct folio *src, *tmp;
-	pte_t *_pte;
 	pte_t pteval;
+	pte_t *_pte;
+	unsigned int nr_ptes;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, address += PAGE_SIZE) {
+	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+	     address += nr_ptes * PAGE_SIZE) {
+		nr_ptes = 1;
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
@@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 			struct page *src_page = pte_page(pteval);
 
 			src = page_folio(src_page);
-			if (!folio_test_large(src))
+
+			if (folio_test_large(src)) {
+				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
+
+				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
+			} else {
 				release_pte_folio(src);
+			}
+
 			/*
 			 * ptl mostly unnecessary, but preempt has to
 			 * be disabled to update the per-cpu stats
 			 * inside folio_remove_rmap_pte().
 			 */
 			spin_lock(ptl);
-			ptep_clear(vma->vm_mm, address, _pte);
-			folio_remove_rmap_pte(src, src_page, vma);
+			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
+			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
 			spin_unlock(ptl);
-			free_folio_and_swap_cache(src);
+			free_swap_cache(src);
+			folio_put_refs(src, nr_ptes);
 		}
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
@ 2025-07-24 17:55   ` Lorenzo Stoakes
  2025-07-24 17:57     ` David Hildenbrand
  2025-07-24 18:02     ` Lorenzo Stoakes
  2025-07-29 14:38   ` Zi Yan
  1 sibling, 2 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 17:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

Trying this again as my mail client apparently messed this up:


NIT: Please don't capitalise 'Optimize' here.

I think Andrew fixed this for you actually in the repo though :P

On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote:
> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching refcount-mapcount manipulation on
> the folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a55fb1dcd224..f23e943506bc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						spinlock_t *ptl,
>  						struct list_head *compound_pagelist)
>  {
> +	unsigned long end = address + HPAGE_PMD_SIZE;
>  	struct folio *src, *tmp;
> -	pte_t *_pte;
>  	pte_t pteval;
> +	pte_t *_pte;
> +	unsigned int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	     address += nr_ptes * PAGE_SIZE) {
> +		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  			struct page *src_page = pte_page(pteval);
>
>  			src = page_folio(src_page);
> -			if (!folio_test_large(src))
> +
> +			if (folio_test_large(src)) {
> +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
> +
> +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
> +			} else {
>  				release_pte_folio(src);
> +			}
> +
>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
>  			 * inside folio_remove_rmap_pte().
>  			 */
>  			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>  			spin_unlock(ptl);
> -			free_folio_and_swap_cache(src);
> +			free_swap_cache(src);
> +			folio_put_refs(src, nr_ptes);

Hm one thing here though is the free_folio_and_swap_cache() does:

        free_swap_cache(folio);
        if (!is_huge_zero_folio(folio))
                folio_put(folio);

Whereas here you unconditionally reduce the reference count. Might this
cause issues with the shrinker version of the huge zero folio?

Should this be:

                        if (!is_huge_zero_folio(src))
                                folio_put_refs(src, nr_ptes);

Or do we otherwise avoid issues with this?


>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24 17:55   ` Lorenzo Stoakes
@ 2025-07-24 17:57     ` David Hildenbrand
  2025-07-24 18:01       ` Lorenzo Stoakes
  2025-07-24 18:02     ` Lorenzo Stoakes
  1 sibling, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-07-24 17:57 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel


>> +			if (folio_test_large(src)) {
>> +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
>> +
>> +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
>> +			} else {
>>   				release_pte_folio(src);
>> +			}
>> +
>>   			/*
>>   			 * ptl mostly unnecessary, but preempt has to
>>   			 * be disabled to update the per-cpu stats
>>   			 * inside folio_remove_rmap_pte().
>>   			 */
>>   			spin_lock(ptl);
>> -			ptep_clear(vma->vm_mm, address, _pte);
>> -			folio_remove_rmap_pte(src, src_page, vma);
>> +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
>> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>>   			spin_unlock(ptl);
>> -			free_folio_and_swap_cache(src);
>> +			free_swap_cache(src);
>> +			folio_put_refs(src, nr_ptes);
> 
> Hm one thing here though is the free_folio_and_swap_cache() does:
> 
>          free_swap_cache(folio);
>          if (!is_huge_zero_folio(folio))
>                  folio_put(folio);
> 
> Whereas here you unconditionally reduce the reference count. Might this
> cause issues with the shrinker version of the huge zero folio?
> 
> Should this be:
> 
>                          if (!is_huge_zero_folio(src))
>                                  folio_put_refs(src, nr_ptes);
> 
> Or do we otherwise avoid issues with this?

(resending my reply)

The huge zero folio is never PTE-mapped.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24 17:57     ` David Hildenbrand
@ 2025-07-24 18:01       ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 18:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 07:57:22PM +0200, David Hildenbrand wrote:
>
> > > +			if (folio_test_large(src)) {
> > > +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
> > > +
> > > +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
> > > +			} else {
> > >   				release_pte_folio(src);
> > > +			}
> > > +
> > >   			/*
> > >   			 * ptl mostly unnecessary, but preempt has to
> > >   			 * be disabled to update the per-cpu stats
> > >   			 * inside folio_remove_rmap_pte().
> > >   			 */
> > >   			spin_lock(ptl);
> > > -			ptep_clear(vma->vm_mm, address, _pte);
> > > -			folio_remove_rmap_pte(src, src_page, vma);
> > > +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
> > > +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
> > >   			spin_unlock(ptl);
> > > -			free_folio_and_swap_cache(src);
> > > +			free_swap_cache(src);
> > > +			folio_put_refs(src, nr_ptes);
> >
> > Hm one thing here though is the free_folio_and_swap_cache() does:
> >
> >          free_swap_cache(folio);
> >          if (!is_huge_zero_folio(folio))
> >                  folio_put(folio);
> >
> > Whereas here you unconditionally reduce the reference count. Might this
> > cause issues with the shrinker version of the huge zero folio?
> >
> > Should this be:
> >
> >                          if (!is_huge_zero_folio(src))
> >                                  folio_put_refs(src, nr_ptes);
> >
> > Or do we otherwise avoid issues with this?
>
> (resending my reply)
>
> The huge zero folio is never PTE-mapped.

OK fine, as mentioned off-list I hate this kind of 'implicit' knowledge, and you
pointed out that really we should be using vm_normal_page() or equivalent in
this code. One to address at some point :)

Anyway with this concern addressed, the patch is fine, will send tag...

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24 17:55   ` Lorenzo Stoakes
  2025-07-24 17:57     ` David Hildenbrand
@ 2025-07-24 18:02     ` Lorenzo Stoakes
  1 sibling, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 18:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 06:55:56PM +0100, Lorenzo Stoakes wrote:
> Trying this again as my mail client apparently messed this up:
>
>
> NIT: Please don't capitalise 'Optimize' here.
>
> I think Andrew fixed this for you actually in the repo though :P
>
> On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote:
> > Use PTE batching to batch process PTEs mapping the same large folio. An
> > improvement is expected due to batching refcount-mapcount manipulation on
> > the folios, and for arm64 which supports contig mappings, the number of
> > TLB flushes is also reduced.
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>

With the concern I raised addressed by David, this LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> > ---
> >  mm/khugepaged.c | 25 ++++++++++++++++++-------
> >  1 file changed, 18 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a55fb1dcd224..f23e943506bc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >  						spinlock_t *ptl,
> >  						struct list_head *compound_pagelist)
> >  {
> > +	unsigned long end = address + HPAGE_PMD_SIZE;
> >  	struct folio *src, *tmp;
> > -	pte_t *_pte;
> >  	pte_t pteval;
> > +	pte_t *_pte;
> > +	unsigned int nr_ptes;
> >
> > -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -	     _pte++, address += PAGE_SIZE) {
> > +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> > +	     address += nr_ptes * PAGE_SIZE) {
> > +		nr_ptes = 1;
> >  		pteval = ptep_get(_pte);
> >  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >  			struct page *src_page = pte_page(pteval);
> >
> >  			src = page_folio(src_page);
> > -			if (!folio_test_large(src))
> > +
> > +			if (folio_test_large(src)) {
> > +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
> > +
> > +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
> > +			} else {
> >  				release_pte_folio(src);
> > +			}
> > +
> >  			/*
> >  			 * ptl mostly unnecessary, but preempt has to
> >  			 * be disabled to update the per-cpu stats
> >  			 * inside folio_remove_rmap_pte().
> >  			 */
> >  			spin_lock(ptl);
> > -			ptep_clear(vma->vm_mm, address, _pte);
> > -			folio_remove_rmap_pte(src, src_page, vma);
> > +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
> > +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
> >  			spin_unlock(ptl);
> > -			free_folio_and_swap_cache(src);
> > +			free_swap_cache(src);
> > +			folio_put_refs(src, nr_ptes);
>
> Hm one thing here though is the free_folio_and_swap_cache() does:
>
>         free_swap_cache(folio);
>         if (!is_huge_zero_folio(folio))
>                 folio_put(folio);
>
> Whereas here you unconditionally reduce the reference count. Might this
> cause issues with the shrinker version of the huge zero folio?
>
> Should this be:
>
>                         if (!is_huge_zero_folio(src))
>                                 folio_put_refs(src, nr_ptes);
>
> Or do we otherwise avoid issues with this?
>
>
> >  		}
> >  	}
> >
> > --
> > 2.30.2
> >
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
  2025-07-24 17:55   ` Lorenzo Stoakes
@ 2025-07-29 14:38   ` Zi Yan
  1 sibling, 0 replies; 16+ messages in thread
From: Zi Yan @ 2025-07-29 14:38 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 24 Jul 2025, at 1:23, Dev Jain wrote:

> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching refcount-mapcount manipulation on
> the folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() by PTE batching
  2025-07-24  5:22 [PATCH v4 0/3] Optimizations for khugepaged Dev Jain
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
  2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
@ 2025-07-24  5:23 ` Dev Jain
  2025-07-24 18:07   ` Lorenzo Stoakes
  2025-07-29 14:41   ` Zi Yan
  2 siblings, 2 replies; 16+ messages in thread
From: Dev Jain @ 2025-07-24  5:23 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching mapcount manipulation on the
folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.

Note that we do not need to make a change to the check
"if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
pages of the folio will be equal to the corresponding pages of our
batch mapping consecutive pages.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f23e943506bc..374a6a5193a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1503,15 +1503,17 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	int nr_mapped_ptes = 0, result = SCAN_FAIL;
+	unsigned int nr_batch_ptes;
 	struct mmu_notifier_range range;
 	bool notified = false;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	unsigned long end = haddr + HPAGE_PMD_SIZE;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct folio *folio;
 	pte_t *start_pte, *pte;
 	pmd_t *pmd, pgt_pmd;
 	spinlock_t *pml = NULL, *ptl;
-	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
 	mmap_assert_locked(mm);
@@ -1625,11 +1627,15 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto abort;
 
 	/* step 2: clear page table and adjust rmap */
-	for (i = 0, addr = haddr, pte = start_pte;
-	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
+	for (i = 0, addr = haddr, pte = start_pte; i < HPAGE_PMD_NR;
+	     i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
+	     pte += nr_batch_ptes) {
+		unsigned int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
 		struct page *page;
 		pte_t ptent = ptep_get(pte);
 
+		nr_batch_ptes = 1;
+
 		if (pte_none(ptent))
 			continue;
 		/*
@@ -1643,26 +1649,29 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			goto abort;
 		}
 		page = vm_normal_page(vma, addr, ptent);
+
 		if (folio_page(folio, i) != page)
 			goto abort;
 
+		nr_batch_ptes = folio_pte_batch(folio, pte, ptent, max_nr_batch_ptes);
+
 		/*
 		 * Must clear entry, or a racing truncate may re-remove it.
 		 * TLB flush can be left until pmdp_collapse_flush() does it.
 		 * PTE dirty? Shmem page is already dirty; file is read-only.
 		 */
-		ptep_clear(mm, addr, pte);
-		folio_remove_rmap_pte(folio, page, vma);
-		nr_ptes++;
+		clear_ptes(mm, addr, pte, nr_batch_ptes);
+		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
+		nr_mapped_ptes += nr_batch_ptes;
 	}
 
 	if (!pml)
 		spin_unlock(ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (nr_ptes) {
-		folio_ref_sub(folio, nr_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
+	if (nr_mapped_ptes) {
+		folio_ref_sub(folio, nr_mapped_ptes);
+		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 
 	/* step 4: remove empty page table */
@@ -1695,10 +1704,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			: SCAN_SUCCEED;
 	goto drop_folio;
 abort:
-	if (nr_ptes) {
+	if (nr_mapped_ptes) {
 		flush_tlb_mm(mm);
-		folio_ref_sub(folio, nr_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
+		folio_ref_sub(folio, nr_mapped_ptes);
+		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 unlock:
 	if (start_pte)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() by PTE batching
  2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
@ 2025-07-24 18:07   ` Lorenzo Stoakes
  2025-07-29 14:41   ` Zi Yan
  1 sibling, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 18:07 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 10:53:01AM +0530, Dev Jain wrote:
> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching mapcount manipulation on the
> folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Note that we do not need to make a change to the check
> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> pages of the folio will be equal to the corresponding pages of our
> batch mapping consecutive pages.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/khugepaged.c | 33 +++++++++++++++++++++------------
>  1 file changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f23e943506bc..374a6a5193a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1503,15 +1503,17 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			    bool install_pmd)
>  {
> +	int nr_mapped_ptes = 0, result = SCAN_FAIL;
> +	unsigned int nr_batch_ptes;
>  	struct mmu_notifier_range range;
>  	bool notified = false;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	unsigned long end = haddr + HPAGE_PMD_SIZE;
>  	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>  	struct folio *folio;
>  	pte_t *start_pte, *pte;
>  	pmd_t *pmd, pgt_pmd;
>  	spinlock_t *pml = NULL, *ptl;
> -	int nr_ptes = 0, result = SCAN_FAIL;
>  	int i;
>
>  	mmap_assert_locked(mm);
> @@ -1625,11 +1627,15 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		goto abort;
>
>  	/* step 2: clear page table and adjust rmap */
> -	for (i = 0, addr = haddr, pte = start_pte;
> -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +	for (i = 0, addr = haddr, pte = start_pte; i < HPAGE_PMD_NR;
> +	     i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
> +	     pte += nr_batch_ptes) {
> +		unsigned int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
>  		struct page *page;
>  		pte_t ptent = ptep_get(pte);
>
> +		nr_batch_ptes = 1;
> +
>  		if (pte_none(ptent))
>  			continue;
>  		/*
> @@ -1643,26 +1649,29 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			goto abort;
>  		}
>  		page = vm_normal_page(vma, addr, ptent);
> +
>  		if (folio_page(folio, i) != page)
>  			goto abort;
>
> +		nr_batch_ptes = folio_pte_batch(folio, pte, ptent, max_nr_batch_ptes);
> +
>  		/*
>  		 * Must clear entry, or a racing truncate may re-remove it.
>  		 * TLB flush can be left until pmdp_collapse_flush() does it.
>  		 * PTE dirty? Shmem page is already dirty; file is read-only.
>  		 */
> -		ptep_clear(mm, addr, pte);
> -		folio_remove_rmap_pte(folio, page, vma);
> -		nr_ptes++;
> +		clear_ptes(mm, addr, pte, nr_batch_ptes);
> +		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
> +		nr_mapped_ptes += nr_batch_ptes;
>  	}
>
>  	if (!pml)
>  		spin_unlock(ptl);
>
>  	/* step 3: set proper refcount and mm_counters. */
> -	if (nr_ptes) {
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +	if (nr_mapped_ptes) {
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>
>  	/* step 4: remove empty page table */
> @@ -1695,10 +1704,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			: SCAN_SUCCEED;
>  	goto drop_folio;
>  abort:
> -	if (nr_ptes) {
> +	if (nr_mapped_ptes) {
>  		flush_tlb_mm(mm);
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>  unlock:
>  	if (start_pte)
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() by PTE batching
  2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
  2025-07-24 18:07   ` Lorenzo Stoakes
@ 2025-07-29 14:41   ` Zi Yan
  1 sibling, 0 replies; 16+ messages in thread
From: Zi Yan @ 2025-07-29 14:41 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 24 Jul 2025, at 1:23, Dev Jain wrote:

> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching mapcount manipulation on the
> folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Note that we do not need to make a change to the check
> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> pages of the folio will be equal to the corresponding pages of our
> batch mapping consecutive pages.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 33 +++++++++++++++++++++------------
>  1 file changed, 21 insertions(+), 12 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
@ 2025-07-24 17:32 Lorenzo Stoakes
  2025-07-24 17:40 ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 17:32 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

Message-ID: <32843cfb-a70b-4dfb-965c-4e1b0623a1b4@lucifer.local>
Reply-To:
In-Reply-To: <20250724052301.23844-3-dev.jain@arm.com>

NIT: Please don't capitalise 'Optimize' here.

I think Andrew fixed this for you actually in the repo though :P

On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote:
> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching refcount-mapcount manipulation on
> the folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a55fb1dcd224..f23e943506bc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						spinlock_t *ptl,
>  						struct list_head *compound_pagelist)
>  {
> +	unsigned long end = address + HPAGE_PMD_SIZE;
>  	struct folio *src, *tmp;
> -	pte_t *_pte;
>  	pte_t pteval;
> +	pte_t *_pte;
> +	unsigned int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	     address += nr_ptes * PAGE_SIZE) {
> +		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  			struct page *src_page = pte_page(pteval);
>
>  			src = page_folio(src_page);
> -			if (!folio_test_large(src))
> +
> +			if (folio_test_large(src)) {
> +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
> +
> +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
> +			} else {
>  				release_pte_folio(src);
> +			}
> +
>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
>  			 * inside folio_remove_rmap_pte().
>  			 */
>  			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>  			spin_unlock(ptl);
> -			free_folio_and_swap_cache(src);
> +			free_swap_cache(src);
> +			folio_put_refs(src, nr_ptes);

Hm one thing here though is the free_folio_and_swap_cache() does:

	free_swap_cache(folio);
	if (!is_huge_zero_folio(folio))
		folio_put(folio);

Whereas here you unconditionally reduce the reference count. Might this
cause issues with the shrinker version of the huge zero folio?

Should this be:

			if (!is_huge_zero_folio(src))
				folio_put_refs(src, nr_ptes);

Or do we otherwise avoid issues with this?

>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
  2025-07-24 17:32 [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() " Lorenzo Stoakes
@ 2025-07-24 17:40 ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-07-24 17:40 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On 24.07.25 19:32, Lorenzo Stoakes wrote:
> Message-ID: <32843cfb-a70b-4dfb-965c-4e1b0623a1b4@lucifer.local>
> Reply-To:
> In-Reply-To: <20250724052301.23844-3-dev.jain@arm.com>
> 
> NIT: Please don't capitalise 'Optimize' here.
> 
> I think Andrew fixed this for you actually in the repo though :P
> 
> On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote:
>> Use PTE batching to batch process PTEs mapping the same large folio. An
>> improvement is expected due to batching refcount-mapcount manipulation on
>> the folios, and for arm64 which supports contig mappings, the number of
>> TLB flushes is also reduced.
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 25 ++++++++++++++++++-------
>>   1 file changed, 18 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index a55fb1dcd224..f23e943506bc 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   						spinlock_t *ptl,
>>   						struct list_head *compound_pagelist)
>>   {
>> +	unsigned long end = address + HPAGE_PMD_SIZE;
>>   	struct folio *src, *tmp;
>> -	pte_t *_pte;
>>   	pte_t pteval;
>> +	pte_t *_pte;
>> +	unsigned int nr_ptes;
>>
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> -	     _pte++, address += PAGE_SIZE) {
>> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>> +	     address += nr_ptes * PAGE_SIZE) {
>> +		nr_ptes = 1;
>>   		pteval = ptep_get(_pte);
>>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>> @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   			struct page *src_page = pte_page(pteval);
>>
>>   			src = page_folio(src_page);
>> -			if (!folio_test_large(src))
>> +
>> +			if (folio_test_large(src)) {
>> +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
>> +
>> +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
>> +			} else {
>>   				release_pte_folio(src);
>> +			}
>> +
>>   			/*
>>   			 * ptl mostly unnecessary, but preempt has to
>>   			 * be disabled to update the per-cpu stats
>>   			 * inside folio_remove_rmap_pte().
>>   			 */
>>   			spin_lock(ptl);
>> -			ptep_clear(vma->vm_mm, address, _pte);
>> -			folio_remove_rmap_pte(src, src_page, vma);
>> +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
>> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>>   			spin_unlock(ptl);
>> -			free_folio_and_swap_cache(src);
>> +			free_swap_cache(src);
>> +			folio_put_refs(src, nr_ptes);
> 
> Hm one thing here though is the free_folio_and_swap_cache() does:
> 
> 	free_swap_cache(folio);
> 	if (!is_huge_zero_folio(folio))
> 		folio_put(folio);
> 
> Whereas here you unconditionally reduce the reference count. Might this
> cause issues with the shrinker version of the huge zero folio?
> 
> Should this be:
> 
> 			if (!is_huge_zero_folio(src))
> 				folio_put_refs(src, nr_ptes);
> 
> Or do we otherwise avoid issues with this?

The huge zero folio is never PTE-mapped.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-07-29 14:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24  5:22 [PATCH v4 0/3] Optimizations for khugepaged Dev Jain
2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
2025-07-24  9:31   ` Barry Song
2025-07-24 17:17   ` Lorenzo Stoakes
2025-07-29 14:30   ` Zi Yan
2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
2025-07-24 17:55   ` Lorenzo Stoakes
2025-07-24 17:57     ` David Hildenbrand
2025-07-24 18:01       ` Lorenzo Stoakes
2025-07-24 18:02     ` Lorenzo Stoakes
2025-07-29 14:38   ` Zi Yan
2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
2025-07-24 18:07   ` Lorenzo Stoakes
2025-07-29 14:41   ` Zi Yan
  -- strict thread matches above, loose matches on Subject: below --
2025-07-24 17:32 [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() " Lorenzo Stoakes
2025-07-24 17:40 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).