[PATCH 0/3] Optimize mremap() by PTE-batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Optimize mremap() by PTE-batching
@ 2025-05-06  5:00 Dev Jain
  2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06  5:00 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, Dev Jain

Use PTE batching to optimize mremap().

Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
munmapping it 10,000 times, the average execution time reduces from 1.9 to
1.2 seconds, giving a 37% performance optimization. (Apple M3)

Dev Jain (3):
  mm: Call pointers to ptes as ptep
  mm: Add generic helper to hint a large folio
  mm: Optimize mremap() by PTE batching

 include/linux/pgtable.h | 16 +++++++++++++++
 mm/mremap.c             | 44 +++++++++++++++++++++++++++--------------
 2 files changed, 45 insertions(+), 15 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] mm: Call pointers to ptes as ptep
  2025-05-06  5:00 [PATCH 0/3] Optimize mremap() by PTE-batching Dev Jain
@ 2025-05-06  5:00 ` Dev Jain
  2025-05-06  8:50   ` Anshuman Khandual
  2025-05-06 10:52   ` Lorenzo Stoakes
  2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06  5:00 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, Dev Jain

Avoid confusion between pte_t* and pte_t data types by appending pointer
type variables by p. No functional change.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84..1a08a7c3b92f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	struct vm_area_struct *vma = pmc->old;
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_ptep, *new_ptep, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -211,8 +211,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * We don't have to worry about the ordering of src and dst
 	 * pte locks because exclusive mmap_lock prevents deadlock.
 	 */
-	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
-	if (!old_pte) {
+	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	if (!old_ptep) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -223,10 +223,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
 	 * pmdval and do pmd_same() check.
 	 */
-	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
 					   &new_ptl);
-	if (!new_pte) {
-		pte_unmap_unlock(old_pte, old_ptl);
+	if (!new_ptep) {
+		pte_unmap_unlock(old_ptep, old_ptl);
 		err = -EAGAIN;
 		goto out;
 	}
@@ -235,12 +235,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
-				   new_pte++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_pte)))
+	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
+				   new_ptep++, new_addr += PAGE_SIZE) {
+		if (pte_none(ptep_get(old_ptep)))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -258,7 +258,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		pte = move_soft_dirty_pte(pte);
 
 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-			pte_clear(mm, new_addr, new_pte);
+			pte_clear(mm, new_addr, new_ptep);
 		else {
 			if (need_clear_uffd_wp) {
 				if (pte_present(pte))
@@ -266,7 +266,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_pte, pte);
+			set_pte_at(mm, new_addr, new_ptep, pte);
 		}
 	}
 
@@ -275,8 +275,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-	pte_unmap(new_pte - 1);
-	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_unmap(new_ptep - 1);
+	pte_unmap_unlock(old_ptep - 1, old_ptl);
 out:
 	if (pmc->need_rmap_locks)
 		drop_rmap_locks(vma);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06  5:00 [PATCH 0/3] Optimize mremap() by PTE-batching Dev Jain
  2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-05-06  5:00 ` Dev Jain
  2025-05-06  9:10   ` Anshuman Khandual
                     ` (2 more replies)
  2025-05-06  5:00 ` [PATCH 3/3] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-06  9:16 ` [PATCH 0/3] Optimize mremap() by PTE-batching Anshuman Khandual
  3 siblings, 3 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06  5:00 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, Dev Jain

To use PTE batching, we want to determine whether the folio mapped by
the PTE is large, thus requiring the use of vm_normal_folio(). We want
to avoid the cost of vm_normal_folio() if the code path doesn't already
require the folio. For arm64, pte_batch_hint() does the job. To generalize
this hint, add a helper which will determine whether two consecutive PTEs
point to consecutive PFNs, in which case there is a high probability that
the underlying folio is large.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..28e21fcc7837 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
 }
 #endif
 
+/* Caller must ensure that ptep + 1 exists */
+static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+	pte_t *next_ptep, next_pte;
+
+	if (pte_batch_hint(ptep, pte) != 1)
+		return true;
+
+	next_ptep = ptep + 1;
+	next_pte = ptep_get(next_ptep);
+	if (!pte_present(next_pte))
+		return false;
+
+	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
+}
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06  5:00 [PATCH 0/3] Optimize mremap() by PTE-batching Dev Jain
  2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
@ 2025-05-06  5:00 ` Dev Jain
  2025-05-06 10:10   ` Anshuman Khandual
  2025-05-06 13:49   ` Lorenzo Stoakes
  2025-05-06  9:16 ` [PATCH 0/3] Optimize mremap() by PTE-batching Anshuman Khandual
  3 siblings, 2 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06  5:00 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, Dev Jain

Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
so as to elide TLBIs on each contig block, which was previously done by
ptep_get_and_clear().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1a08a7c3b92f..3621c07d8eea 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	struct vm_area_struct *vma = pmc->old;
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_ptep, *new_ptep, pte;
+	pte_t *old_ptep, *new_ptep, old_pte, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	unsigned long old_end = old_addr + extent;
 	unsigned long len = old_end - old_addr;
 	int err = 0;
+	int nr;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
 
 	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
 				   new_ptep++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_ptep)))
+		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
+
+		nr = 1;
+		old_pte = ptep_get(old_ptep);
+		if (pte_none(old_pte))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
-		if (pte_present(pte))
+		if (pte_present(old_pte)) {
+			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
+				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
+
+				if (folio && folio_test_large(folio))
+					nr = folio_pte_batch(folio, old_addr, old_ptep,
+					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
+			}
 			force_flush = true;
+		}
+		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
@@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_ptep, pte);
+			set_ptes(mm, new_addr, new_ptep, pte, nr);
 		}
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm: Call pointers to ptes as ptep
  2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-05-06  8:50   ` Anshuman Khandual
  2025-05-06  9:05     ` Lorenzo Stoakes
  2025-05-06 10:52   ` Lorenzo Stoakes
  1 sibling, 1 reply; 27+ messages in thread
From: Anshuman Khandual @ 2025-05-06  8:50 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On 5/6/25 10:30, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by appending pointer
> type variables by p. No functional change.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84..1a08a7c3b92f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_ptep, *new_ptep, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -211,8 +211,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * We don't have to worry about the ordering of src and dst
>  	 * pte locks because exclusive mmap_lock prevents deadlock.
>  	 */
> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -	if (!old_pte) {
> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +	if (!old_ptep) {
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -223,10 +223,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * mmap_lock, so this new_pte page is stable, so there is no need to get
>  	 * pmdval and do pmd_same() check.
>  	 */
> -	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
> +	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>  					   &new_ptl);
> -	if (!new_pte) {
> -		pte_unmap_unlock(old_pte, old_ptl);
> +	if (!new_ptep) {
> +		pte_unmap_unlock(old_ptep, old_ptl);
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -235,12 +235,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  
> -	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> -				   new_pte++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_pte)))
> +	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> +				   new_ptep++, new_addr += PAGE_SIZE) {
> +		if (pte_none(ptep_get(old_ptep)))
>  			continue;
>  
> -		pte = ptep_get_and_clear(mm, old_addr, old_pte);
> +		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -258,7 +258,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		pte = move_soft_dirty_pte(pte);
>  
>  		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -			pte_clear(mm, new_addr, new_pte);
> +			pte_clear(mm, new_addr, new_ptep);
>  		else {
>  			if (need_clear_uffd_wp) {
>  				if (pte_present(pte))
> @@ -266,7 +266,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_pte, pte);
> +			set_pte_at(mm, new_addr, new_ptep, pte);
>  		}
>  	}
>  
> @@ -275,8 +275,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
> -	pte_unmap(new_pte - 1);
> -	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	pte_unmap(new_ptep - 1);
> +	pte_unmap_unlock(old_ptep - 1, old_ptl);
>  out:
>  	if (pmc->need_rmap_locks)
>  		drop_rmap_locks(vma);

Although this makes sense, mm/mremap.c might not be the only place with such
naming confusions regarding page table entry and page table pointers. Rather
than fixing this in just a single file, I guess generic MM wide change will
be better.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm: Call pointers to ptes as ptep
  2025-05-06  8:50   ` Anshuman Khandual
@ 2025-05-06  9:05     ` Lorenzo Stoakes
  0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06  9:05 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Dev Jain, akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On Tue, May 06, 2025 at 02:20:01PM +0530, Anshuman Khandual wrote:
> On 5/6/25 10:30, Dev Jain wrote:
> > Avoid confusion between pte_t* and pte_t data types by appending pointer
> > type variables by p. No functional change.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
[snip]
> Although this makes sense, mm/mremap.c might not be the only place with such
> naming confusions regarding page table entry and page table pointers. Rather
> than fixing this in just a single file, I guess generic MM wide change will
> be better.

No, let's not please. that will be a ton of churn. This confusion literally
already exists all over the place so it's not egregious.

I am sympathetic to this kind of change (I really _hate_ this as an issue), and
since I recently refactored all of mremap I am likely to be ok with this here,
but let's do any such change iteratively, please.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
@ 2025-05-06  9:10   ` Anshuman Khandual
  2025-05-06 13:34     ` Lorenzo Stoakes
  2025-05-06 15:46   ` Matthew Wilcox
  2025-05-07 10:03   ` David Hildenbrand
  2 siblings, 1 reply; 27+ messages in thread
From: Anshuman Khandual @ 2025-05-06  9:10 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On 5/6/25 10:30, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..28e21fcc7837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>  
> +/* Caller must ensure that ptep + 1 exists */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> +}
> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					    unsigned long address,

As mentioned earlier, this new helper maybe_contiguous_pte_pfns() does not
have a proposed caller. Hence please do consider folding this forward with
the next patch where move_ptes() starts using the helper. Besides, it is
also difficult to review this patch without a proper caller context.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] Optimize mremap() by PTE-batching
  2025-05-06  5:00 [PATCH 0/3] Optimize mremap() by PTE-batching Dev Jain
                   ` (2 preceding siblings ...)
  2025-05-06  5:00 ` [PATCH 3/3] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-06  9:16 ` Anshuman Khandual
  2025-05-06 10:22   ` Dev Jain
  3 siblings, 1 reply; 27+ messages in thread
From: Anshuman Khandual @ 2025-05-06  9:16 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On 5/6/25 10:30, Dev Jain wrote:
> Use PTE batching to optimize mremap().
> 
> Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
> munmapping it 10,000 times, the average execution time reduces from 1.9 to
> 1.2 seconds, giving a 37% performance optimization. (Apple M3)

That's impressive improvement. But could you please re-organize the test
description into a pseudo code format or better provide the test program
itself (which should be compact anyways) just to be more clear about the
scenario where this helps.

> 
> Dev Jain (3):
>   mm: Call pointers to ptes as ptep
>   mm: Add generic helper to hint a large folio
>   mm: Optimize mremap() by PTE batching
> 
>  include/linux/pgtable.h | 16 +++++++++++++++
>  mm/mremap.c             | 44 +++++++++++++++++++++++++++--------------
>  2 files changed, 45 insertions(+), 15 deletions(-)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06  5:00 ` [PATCH 3/3] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-06 10:10   ` Anshuman Khandual
  2025-05-06 10:20     ` Dev Jain
  2025-05-06 13:49   ` Lorenzo Stoakes
  1 sibling, 1 reply; 27+ messages in thread
From: Anshuman Khandual @ 2025-05-06 10:10 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On 5/6/25 10:30, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
> so as to elide TLBIs on each contig block, which was previously done by
> ptep_get_and_clear().
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1a08a7c3b92f..3621c07d8eea 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_ptep, *new_ptep, pte;
> +	pte_t *old_ptep, *new_ptep, old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
>  	int err = 0;
> +	int nr;
>  
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  
>  	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>  				   new_ptep++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_ptep)))
> +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> +
> +		nr = 1;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>  			continue;
>  
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {

maybe_contiguous_pte_pfns() cost will be applicable for memory
areas greater than a single PAGE_SIZE (i.e max_nr != 1) ? This
helper extracts an additional consecutive pte, ensures that it
is valid mapped and extracts pfn before comparing for the span.

There is some cost associated with the above code sequence which
looks justified for sequential access of memory buffers that has
consecutive physical memory backing. But what happens when such
buffers are less probable, will those buffers take a performance
hit for all the comparisons that just turn out to be negative ?

> +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> +
> +				if (folio && folio_test_large(folio))
> +					nr = folio_pte_batch(folio, old_addr, old_ptep,
> +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
> +			}
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>  
> @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>  		}
>  	}
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06 10:10   ` Anshuman Khandual
@ 2025-05-06 10:20     ` Dev Jain
  0 siblings, 0 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06 10:20 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang



On 06/05/25 3:40 pm, Anshuman Khandual wrote:
> On 5/6/25 10:30, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
>> so as to elide TLBIs on each contig block, which was previously done by
>> ptep_get_and_clear().
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 24 +++++++++++++++++++-----
>>   1 file changed, 19 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 1a08a7c3b92f..3621c07d8eea 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	struct vm_area_struct *vma = pmc->old;
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>> -	pte_t *old_ptep, *new_ptep, pte;
>> +	pte_t *old_ptep, *new_ptep, old_pte, pte;
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>>   	int err = 0;
>> +	int nr;
>>   
>>   	/*
>>   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>> @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   
>>   	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>   				   new_ptep++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_ptep)))
>> +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>> +
>> +		nr = 1;
>> +		old_pte = ptep_get(old_ptep);
>> +		if (pte_none(old_pte))
>>   			continue;
>>   
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
> 
> maybe_contiguous_pte_pfns() cost will be applicable for memory
> areas greater than a single PAGE_SIZE (i.e max_nr != 1) ? This
> helper extracts an additional consecutive pte, ensures that it
> is valid mapped and extracts pfn before comparing for the span.
> 
> There is some cost associated with the above code sequence which
> looks justified for sequential access of memory buffers that has
> consecutive physical memory backing.

I did not see any regression for the simple case of mremapping base pages.

> But what happens when such
> buffers are less probable, will those buffers take a performance
> hit for all the comparisons that just turn out to be negative ?

When would that be the case? We are remapping consecutive ptes to 
consecutive ptes.

> 
>> +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
>> +
>> +				if (folio && folio_test_large(folio))
>> +					nr = folio_pte_batch(folio, old_addr, old_ptep,
>> +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>> +			}
>>   			force_flush = true;
>> +		}
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>   
>> @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>   		}
>>   	}
>>   



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] Optimize mremap() by PTE-batching
  2025-05-06  9:16 ` [PATCH 0/3] Optimize mremap() by PTE-batching Anshuman Khandual
@ 2025-05-06 10:22   ` Dev Jain
  2025-05-06 10:44     ` Lorenzo Stoakes
  0 siblings, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-05-06 10:22 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang



On 06/05/25 2:46 pm, Anshuman Khandual wrote:
> On 5/6/25 10:30, Dev Jain wrote:
>> Use PTE batching to optimize mremap().
>>
>> Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
>> munmapping it 10,000 times, the average execution time reduces from 1.9 to
>> 1.2 seconds, giving a 37% performance optimization. (Apple M3)
> 
> That's impressive improvement. But could you please re-organize the test
> description into a pseudo code format or better provide the test program
> itself (which should be compact anyways) just to be more clear about the
> scenario where this helps.

Sure.

> 
>>
>> Dev Jain (3):
>>    mm: Call pointers to ptes as ptep
>>    mm: Add generic helper to hint a large folio
>>    mm: Optimize mremap() by PTE batching
>>
>>   include/linux/pgtable.h | 16 +++++++++++++++
>>   mm/mremap.c             | 44 +++++++++++++++++++++++++++--------------
>>   2 files changed, 45 insertions(+), 15 deletions(-)
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] Optimize mremap() by PTE-batching
  2025-05-06 10:22   ` Dev Jain
@ 2025-05-06 10:44     ` Lorenzo Stoakes
  2025-05-06 11:53       ` Dev Jain
  0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 10:44 UTC (permalink / raw)
  To: Dev Jain
  Cc: Anshuman Khandual, akpm, Liam.Howlett, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On Tue, May 06, 2025 at 03:52:39PM +0530, Dev Jain wrote:
>
>
> On 06/05/25 2:46 pm, Anshuman Khandual wrote:
> > On 5/6/25 10:30, Dev Jain wrote:
> > > Use PTE batching to optimize mremap().
> > >
> > > Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
> > > munmapping it 10,000 times, the average execution time reduces from 1.9 to
> > > 1.2 seconds, giving a 37% performance optimization. (Apple M3)
> >
> > That's impressive improvement. But could you please re-organize the test
> > description into a pseudo code format or better provide the test program
> > itself (which should be compact anyways) just to be more clear about the
> > scenario where this helps.
>
> Sure.

I echo Ashuman's comment, definitely would like to see that.

And wrt to perf improvement, whether it's a microbenchmark or not, that's a
great result so well done :) I echo this also!

However, it'd be good to see some more detail here also, you're kind of missing
out - everything - about why this improvement happens - what the intent of the
series is, anything about large folios, under what circumstances you'll see an
improvement, etc. etc.

While this might duplicate comments you've made elsewhere, it's mandatory for a
series, and Andrew is unlikely to take this without it.

In mm we place the contents of the cover letter in the first commit in the
series, so it gets stored for posterity also!

Cheers, Lorenzo

>
> >
> > >
> > > Dev Jain (3):
> > >    mm: Call pointers to ptes as ptep
> > >    mm: Add generic helper to hint a large folio
> > >    mm: Optimize mremap() by PTE batching
> > >
> > >   include/linux/pgtable.h | 16 +++++++++++++++
> > >   mm/mremap.c             | 44 +++++++++++++++++++++++++++--------------
> > >   2 files changed, 45 insertions(+), 15 deletions(-)
> > >
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm: Call pointers to ptes as ptep
  2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-06  8:50   ` Anshuman Khandual
@ 2025-05-06 10:52   ` Lorenzo Stoakes
  2025-05-06 11:52     ` Dev Jain
  1 sibling, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 10:52 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang

On Tue, May 06, 2025 at 10:30:54AM +0530, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by appending pointer
> type variables by p. No functional change.

NIT: 'appending'->'suffixing' and 'by p' -> with p'.

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

This looks generally fine, could you fix the nit below however... sorry to
be a pain!

Thanks!

> ---
>  mm/mremap.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84..1a08a7c3b92f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_ptep, *new_ptep, pte;

While we're at it, can we please move the pte decl to a new line? Mixing
pointers and non-pointers is not great (I refactored it but mremap still
has a bunch of less-than-ideal stuff in it :)

>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -211,8 +211,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * We don't have to worry about the ordering of src and dst
>  	 * pte locks because exclusive mmap_lock prevents deadlock.
>  	 */
> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -	if (!old_pte) {
> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +	if (!old_ptep) {
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -223,10 +223,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * mmap_lock, so this new_pte page is stable, so there is no need to get
>  	 * pmdval and do pmd_same() check.
>  	 */
> -	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
> +	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>  					   &new_ptl);
> -	if (!new_pte) {
> -		pte_unmap_unlock(old_pte, old_ptl);
> +	if (!new_ptep) {
> +		pte_unmap_unlock(old_ptep, old_ptl);
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -235,12 +235,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> -				   new_pte++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_pte)))
> +	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> +				   new_ptep++, new_addr += PAGE_SIZE) {
> +		if (pte_none(ptep_get(old_ptep)))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_pte);
> +		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -258,7 +258,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		pte = move_soft_dirty_pte(pte);
>
>  		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -			pte_clear(mm, new_addr, new_pte);
> +			pte_clear(mm, new_addr, new_ptep);
>  		else {
>  			if (need_clear_uffd_wp) {
>  				if (pte_present(pte))
> @@ -266,7 +266,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_pte, pte);
> +			set_pte_at(mm, new_addr, new_ptep, pte);
>  		}
>  	}
>
> @@ -275,8 +275,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
> -	pte_unmap(new_pte - 1);
> -	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	pte_unmap(new_ptep - 1);
> +	pte_unmap_unlock(old_ptep - 1, old_ptl);
>  out:
>  	if (pmc->need_rmap_locks)
>  		drop_rmap_locks(vma);
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm: Call pointers to ptes as ptep
  2025-05-06 10:52   ` Lorenzo Stoakes
@ 2025-05-06 11:52     ` Dev Jain
  0 siblings, 0 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06 11:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang



On 06/05/25 4:22 pm, Lorenzo Stoakes wrote:
> On Tue, May 06, 2025 at 10:30:54AM +0530, Dev Jain wrote:
>> Avoid confusion between pte_t* and pte_t data types by appending pointer
>> type variables by p. No functional change.
> 
> NIT: 'appending'->'suffixing' and 'by p' -> with p'.

Thanks.

> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> 
> This looks generally fine, could you fix the nit below however... sorry to
> be a pain!
> 
> Thanks!
> 
>> ---
>>   mm/mremap.c | 28 ++++++++++++++--------------
>>   1 file changed, 14 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 7db9da609c84..1a08a7c3b92f 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	struct vm_area_struct *vma = pmc->old;
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>> -	pte_t *old_pte, *new_pte, pte;
>> +	pte_t *old_ptep, *new_ptep, pte;
> 
> While we're at it, can we please move the pte decl to a new line? Mixing
> pointers and non-pointers is not great (I refactored it but mremap still
> has a bunch of less-than-ideal stuff in it :)

Sure.

> 
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -211,8 +211,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	 * We don't have to worry about the ordering of src and dst
>>   	 * pte locks because exclusive mmap_lock prevents deadlock.
>>   	 */
>> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
>> -	if (!old_pte) {
>> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
>> +	if (!old_ptep) {
>>   		err = -EAGAIN;
>>   		goto out;
>>   	}
>> @@ -223,10 +223,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	 * mmap_lock, so this new_pte page is stable, so there is no need to get
>>   	 * pmdval and do pmd_same() check.
>>   	 */
>> -	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>> +	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>>   					   &new_ptl);
>> -	if (!new_pte) {
>> -		pte_unmap_unlock(old_pte, old_ptl);
>> +	if (!new_ptep) {
>> +		pte_unmap_unlock(old_ptep, old_ptl);
>>   		err = -EAGAIN;
>>   		goto out;
>>   	}
>> @@ -235,12 +235,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>
>> -	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
>> -				   new_pte++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_pte)))
>> +	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> +				   new_ptep++, new_addr += PAGE_SIZE) {
>> +		if (pte_none(ptep_get(old_ptep)))
>>   			continue;
>>
>> -		pte = ptep_get_and_clear(mm, old_addr, old_pte);
>> +		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -258,7 +258,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		pte = move_soft_dirty_pte(pte);
>>
>>   		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -			pte_clear(mm, new_addr, new_pte);
>> +			pte_clear(mm, new_addr, new_ptep);
>>   		else {
>>   			if (need_clear_uffd_wp) {
>>   				if (pte_present(pte))
>> @@ -266,7 +266,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_pte, pte);
>> +			set_pte_at(mm, new_addr, new_ptep, pte);
>>   		}
>>   	}
>>
>> @@ -275,8 +275,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		flush_tlb_range(vma, old_end - len, old_end);
>>   	if (new_ptl != old_ptl)
>>   		spin_unlock(new_ptl);
>> -	pte_unmap(new_pte - 1);
>> -	pte_unmap_unlock(old_pte - 1, old_ptl);
>> +	pte_unmap(new_ptep - 1);
>> +	pte_unmap_unlock(old_ptep - 1, old_ptl);
>>   out:
>>   	if (pmc->need_rmap_locks)
>>   		drop_rmap_locks(vma);
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] Optimize mremap() by PTE-batching
  2025-05-06 10:44     ` Lorenzo Stoakes
@ 2025-05-06 11:53       ` Dev Jain
  0 siblings, 0 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-06 11:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Anshuman Khandual, akpm, Liam.Howlett, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, willy, ioworker0, yang



On 06/05/25 4:14 pm, Lorenzo Stoakes wrote:
> On Tue, May 06, 2025 at 03:52:39PM +0530, Dev Jain wrote:
>>
>>
>> On 06/05/25 2:46 pm, Anshuman Khandual wrote:
>>> On 5/6/25 10:30, Dev Jain wrote:
>>>> Use PTE batching to optimize mremap().
>>>>
>>>> Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
>>>> munmapping it 10,000 times, the average execution time reduces from 1.9 to
>>>> 1.2 seconds, giving a 37% performance optimization. (Apple M3)
>>>
>>> That's impressive improvement. But could you please re-organize the test
>>> description into a pseudo code format or better provide the test program
>>> itself (which should be compact anyways) just to be more clear about the
>>> scenario where this helps.
>>
>> Sure.
> 
> I echo Ashuman's comment, definitely would like to see that.
> 
> And wrt to perf improvement, whether it's a microbenchmark or not, that's a
> great result so well done :) I echo this also!
> 
> However, it'd be good to see some more detail here also, you're kind of missing
> out - everything - about why this improvement happens - what the intent of the
> series is, anything about large folios, under what circumstances you'll see an
> improvement, etc. etc.
> 
> While this might duplicate comments you've made elsewhere, it's mandatory for a
> series, and Andrew is unlikely to take this without it.
> 
> In mm we place the contents of the cover letter in the first commit in the
> series, so it gets stored for posterity also!
> 
> Cheers, Lorenzo

Thanks for your feedback. I'll make the required changes.

> 
>>
>>>
>>>>
>>>> Dev Jain (3):
>>>>     mm: Call pointers to ptes as ptep
>>>>     mm: Add generic helper to hint a large folio
>>>>     mm: Optimize mremap() by PTE batching
>>>>
>>>>    include/linux/pgtable.h | 16 +++++++++++++++
>>>>    mm/mremap.c             | 44 +++++++++++++++++++++++++++--------------
>>>>    2 files changed, 45 insertions(+), 15 deletions(-)
>>>>
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06  9:10   ` Anshuman Khandual
@ 2025-05-06 13:34     ` Lorenzo Stoakes
  0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 13:34 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Dev Jain, akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang

On Tue, May 06, 2025 at 02:40:44PM +0530, Anshuman Khandual wrote:
> On 5/6/25 10:30, Dev Jain wrote:
> > To use PTE batching, we want to determine whether the folio mapped by
> > the PTE is large, thus requiring the use of vm_normal_folio(). We want
> > to avoid the cost of vm_normal_folio() if the code path doesn't already
> > require the folio. For arm64, pte_batch_hint() does the job. To generalize
> > this hint, add a helper which will determine whether two consecutive PTEs
> > point to consecutive PFNs, in which case there is a high probability that
> > the underlying folio is large.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >  include/linux/pgtable.h | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index b50447ef1c92..28e21fcc7837 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +/* Caller must ensure that ptep + 1 exists */
> > +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> > +{
> > +	pte_t *next_ptep, next_pte;
> > +
> > +	if (pte_batch_hint(ptep, pte) != 1)
> > +		return true;
> > +
> > +	next_ptep = ptep + 1;
> > +	next_pte = ptep_get(next_ptep);
> > +	if (!pte_present(next_pte))
> > +		return false;
> > +
> > +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> > +}
> > +
> >  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >  					    unsigned long address,
>
> As mentioned earlier, this new helper maybe_contiguous_pte_pfns() does not
> have a proposed caller. Hence please do consider folding this forward with
> the next patch where move_ptes() starts using the helper. Besides, it is
> also difficult to review this patch without a proper caller context.

Agreed, please combine the two.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06  5:00 ` [PATCH 3/3] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-06 10:10   ` Anshuman Khandual
@ 2025-05-06 13:49   ` Lorenzo Stoakes
  2025-05-06 14:03     ` Lorenzo Stoakes
  2025-05-06 14:10     ` Dev Jain
  1 sibling, 2 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 13:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang

On Tue, May 06, 2025 at 10:30:56AM +0530, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
> so as to elide TLBIs on each contig block, which was previously done by
> ptep_get_and_clear().

No mention of large folios

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1a08a7c3b92f..3621c07d8eea 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_ptep, *new_ptep, pte;
> +	pte_t *old_ptep, *new_ptep, old_pte, pte;

Obviously given previous comment you know what I'm going to say here :) let's
put old_pte, pte in a new decl.

>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
>  	int err = 0;
> +	int nr;
>
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>
>  	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>  				   new_ptep++, new_addr += PAGE_SIZE) {

Hm this just seems wrong, even if we're dealing with a large folio we're still
offsetting by PAGE_SIZE each time and iterating through further sub-pages?

Shouldn't we be doing something like += nr and += PAGE_SIZE * nr?

Then it'd make sense to initialise nr to 1.

Honestly I'd prefer us though to refactor move_ptes() to something like:

	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
				   new_ptep++, new_addr += PAGE_SIZE) {
		pte_t old_pte = ptep_get(old_ptep);

		if (pte_none(old_pte))
			continue;

		move_pte(pmc, vma, old_ptep, old_pte);
	}

Declaring this new move_pte() where you can put the rest of the stuff.

I'd much rather we do this than add to the mess as-is.



> -		if (pte_none(ptep_get(old_ptep)))
> +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> +
> +		nr = 1;
> +		old_pte = ptep_get(old_ptep);

You can declare this in the for loop, no need for us to contaminate whole
function scope with it.

Same with 'nr' in this implementation (though I'd rather you changed it up, see
above).

> +		if (pte_none(old_pte))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
> +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> +
> +				if (folio && folio_test_large(folio))
> +					nr = folio_pte_batch(folio, old_addr, old_ptep,
> +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);

Indentation seems completely broken here? I also hate that we're nesting to this
degree? Can we please find a way not to?

This function is already a bit of a clogged mess, I'd rather we clean up then
add to it.

(See above again :)


> +			}
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>  		}
>  	}
>
> --
> 2.30.2
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06 13:49   ` Lorenzo Stoakes
@ 2025-05-06 14:03     ` Lorenzo Stoakes
  2025-05-06 14:10     ` Dev Jain
  1 sibling, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 14:03 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang

On Tue, May 06, 2025 at 02:49:04PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 06, 2025 at 10:30:56AM +0530, Dev Jain wrote:
> > Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
> > so as to elide TLBIs on each contig block, which was previously done by
> > ptep_get_and_clear().
>
> No mention of large folios

Sorry didn't finish my sentence here. I suggest you add more detail
here. Again the 'why' and what is this for etc. etc.

Equally I don't think having code that seemingly randomly invokes batched
functions is a good idea, I think the relevant logic should be separated
into functions that explicit reference large folios or should have comments
explaining what you're doing.

You can see some of how I separated out such things for my
MREMAP_RELOCATE_ANON series at [0] for instance.

[0]: https://lore.kernel.org/linux-mm/8ca7c8219a2a0e67e1c5c277d0c0d070052de401.1746305604.git.lorenzo.stoakes@oracle.com/

Thanks!

>
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >  mm/mremap.c | 24 +++++++++++++++++++-----
> >  1 file changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 1a08a7c3b92f..3621c07d8eea 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >  	struct vm_area_struct *vma = pmc->old;
> >  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> >  	struct mm_struct *mm = vma->vm_mm;
> > -	pte_t *old_ptep, *new_ptep, pte;
> > +	pte_t *old_ptep, *new_ptep, old_pte, pte;
>
> Obviously given previous comment you know what I'm going to say here :) let's
> put old_pte, pte in a new decl.
>
> >  	pmd_t dummy_pmdval;
> >  	spinlock_t *old_ptl, *new_ptl;
> >  	bool force_flush = false;
> > @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >  	unsigned long old_end = old_addr + extent;
> >  	unsigned long len = old_end - old_addr;
> >  	int err = 0;
> > +	int nr;
> >
> >  	/*
> >  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> > @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >
> >  	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> >  				   new_ptep++, new_addr += PAGE_SIZE) {
>
> Hm this just seems wrong, even if we're dealing with a large folio we're still
> offsetting by PAGE_SIZE each time and iterating through further sub-pages?
>
> Shouldn't we be doing something like += nr and += PAGE_SIZE * nr?
>
> Then it'd make sense to initialise nr to 1.
>
> Honestly I'd prefer us though to refactor move_ptes() to something like:
>
> 	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> 				   new_ptep++, new_addr += PAGE_SIZE) {
> 		pte_t old_pte = ptep_get(old_ptep);
>
> 		if (pte_none(old_pte))
> 			continue;
>
> 		move_pte(pmc, vma, old_ptep, old_pte);
> 	}
>
> Declaring this new move_pte() where you can put the rest of the stuff.
>
> I'd much rather we do this than add to the mess as-is.
>
>
>
> > -		if (pte_none(ptep_get(old_ptep)))
> > +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> > +
> > +		nr = 1;
> > +		old_pte = ptep_get(old_ptep);
>
> You can declare this in the for loop, no need for us to contaminate whole
> function scope with it.
>
> Same with 'nr' in this implementation (though I'd rather you changed it up, see
> above).
>
> > +		if (pte_none(old_pte))
> >  			continue;
> >
> > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> >  		/*
> >  		 * If we are remapping a valid PTE, make sure
> >  		 * to flush TLB before we drop the PTL for the
> > @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >  		 * the TLB entry for the old mapping has been
> >  		 * flushed.
> >  		 */
> > -		if (pte_present(pte))
> > +		if (pte_present(old_pte)) {
> > +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
> > +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> > +
> > +				if (folio && folio_test_large(folio))
> > +					nr = folio_pte_batch(folio, old_addr, old_ptep,
> > +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>
> Indentation seems completely broken here? I also hate that we're nesting to this
> degree? Can we please find a way not to?
>
> This function is already a bit of a clogged mess, I'd rather we clean up then
> add to it.
>
> (See above again :)
>
>
> > +			}
> >  			force_flush = true;
> > +		}
> > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
> >  		pte = move_pte(pte, old_addr, new_addr);
> >  		pte = move_soft_dirty_pte(pte);
> >
> > @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >  				else if (is_swap_pte(pte))
> >  					pte = pte_swp_clear_uffd_wp(pte);
> >  			}
> > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > +			set_ptes(mm, new_addr, new_ptep, pte, nr);
> >  		}
> >  	}
> >
> > --
> > 2.30.2
> >
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06 13:49   ` Lorenzo Stoakes
  2025-05-06 14:03     ` Lorenzo Stoakes
@ 2025-05-06 14:10     ` Dev Jain
  2025-05-06 14:14       ` Lorenzo Stoakes
  1 sibling, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-05-06 14:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang



On 06/05/25 7:19 pm, Lorenzo Stoakes wrote:
> On Tue, May 06, 2025 at 10:30:56AM +0530, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
>> so as to elide TLBIs on each contig block, which was previously done by
>> ptep_get_and_clear().
> 
> No mention of large folios
> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 24 +++++++++++++++++++-----
>>   1 file changed, 19 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 1a08a7c3b92f..3621c07d8eea 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	struct vm_area_struct *vma = pmc->old;
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>> -	pte_t *old_ptep, *new_ptep, pte;
>> +	pte_t *old_ptep, *new_ptep, old_pte, pte;
> 
> Obviously given previous comment you know what I'm going to say here :) let's
> put old_pte, pte in a new decl.
> 
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>>   	int err = 0;
>> +	int nr;
>>
>>   	/*
>>   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>> @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>
>>   	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>   				   new_ptep++, new_addr += PAGE_SIZE) {
> 
> Hm this just seems wrong, even if we're dealing with a large folio we're still
> offsetting by PAGE_SIZE each time and iterating through further sub-pages?
> 
> Shouldn't we be doing something like += nr and += PAGE_SIZE * nr?

This is embarrassing *facepalm* . The crazy part is that I didn't even 
notice this because I got an optimization due to get_and_clear_full_ptes 
-> the number of TLB flushes reduced, and the loop continued due to 
pte_none().

> 
> Then it'd make sense to initialise nr to 1.
> 
> Honestly I'd prefer us though to refactor move_ptes() to something like:
> 
> 	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> 				   new_ptep++, new_addr += PAGE_SIZE) {
> 		pte_t old_pte = ptep_get(old_ptep);
> 
> 		if (pte_none(old_pte))
> 			continue;
> 
> 		move_pte(pmc, vma, old_ptep, old_pte);
> 	}
> 
> Declaring this new move_pte() where you can put the rest of the stuff.
> 
> I'd much rather we do this than add to the mess as-is.
> 
> 
> 
>> -		if (pte_none(ptep_get(old_ptep)))
>> +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>> +
>> +		nr = 1;
>> +		old_pte = ptep_get(old_ptep);
> 
> You can declare this in the for loop, no need for us to contaminate whole
> function scope with it.
> 
> Same with 'nr' in this implementation (though I'd rather you changed it up, see
> above).
> 
>> +		if (pte_none(old_pte))
>>   			continue;
>>
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
>> +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
>> +
>> +				if (folio && folio_test_large(folio))
>> +					nr = folio_pte_batch(folio, old_addr, old_ptep,
>> +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
> 
> Indentation seems completely broken here? I also hate that we're nesting to this
> degree? Can we please find a way not to?
> 
> This function is already a bit of a clogged mess, I'd rather we clean up then
> add to it.
> 
> (See above again :)
> 
> 
>> +			}
>>   			force_flush = true;
>> +		}
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>
>> @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>   		}
>>   	}
>>
>> --
>> 2.30.2
>>
> 
> Cheers, Lorenzo



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] mm: Optimize mremap() by PTE batching
  2025-05-06 14:10     ` Dev Jain
@ 2025-05-06 14:14       ` Lorenzo Stoakes
  0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 14:14 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang

On Tue, May 06, 2025 at 07:40:49PM +0530, Dev Jain wrote:
>
>
> On 06/05/25 7:19 pm, Lorenzo Stoakes wrote:
> > On Tue, May 06, 2025 at 10:30:56AM +0530, Dev Jain wrote:
> > > Use folio_pte_batch() to optimize move_ptes(). Use get_and_clear_full_ptes()
> > > so as to elide TLBIs on each contig block, which was previously done by
> > > ptep_get_and_clear().
> >
> > No mention of large folios
> >
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/mremap.c | 24 +++++++++++++++++++-----
> > >   1 file changed, 19 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 1a08a7c3b92f..3621c07d8eea 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -176,7 +176,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	struct vm_area_struct *vma = pmc->old;
> > >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >   	struct mm_struct *mm = vma->vm_mm;
> > > -	pte_t *old_ptep, *new_ptep, pte;
> > > +	pte_t *old_ptep, *new_ptep, old_pte, pte;
> >
> > Obviously given previous comment you know what I'm going to say here :) let's
> > put old_pte, pte in a new decl.
> >
> > >   	pmd_t dummy_pmdval;
> > >   	spinlock_t *old_ptl, *new_ptl;
> > >   	bool force_flush = false;
> > > @@ -185,6 +185,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	unsigned long old_end = old_addr + extent;
> > >   	unsigned long len = old_end - old_addr;
> > >   	int err = 0;
> > > +	int nr;
> > >
> > >   	/*
> > >   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> > > @@ -237,10 +238,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >
> > >   	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > >   				   new_ptep++, new_addr += PAGE_SIZE) {
> >
> > Hm this just seems wrong, even if we're dealing with a large folio we're still
> > offsetting by PAGE_SIZE each time and iterating through further sub-pages?
> >
> > Shouldn't we be doing something like += nr and += PAGE_SIZE * nr?
>
> This is embarrassing *facepalm* . The crazy part is that I didn't even
> notice this because I got an optimization due to get_and_clear_full_ptes ->
> the number of TLB flushes reduced, and the loop continued due to pte_none().

Haha don't worry, we've all missed things like this in the past (I know I
have...), this is what code review is for :>)

>
> >
> > Then it'd make sense to initialise nr to 1.
> >
> > Honestly I'd prefer us though to refactor move_ptes() to something like:
> >
> > 	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > 				   new_ptep++, new_addr += PAGE_SIZE) {
> > 		pte_t old_pte = ptep_get(old_ptep);
> >
> > 		if (pte_none(old_pte))
> > 			continue;
> >
> > 		move_pte(pmc, vma, old_ptep, old_pte);
> > 	}
> >
> > Declaring this new move_pte() where you can put the rest of the stuff.
> >
> > I'd much rather we do this than add to the mess as-is.
> >
> >
> >
> > > -		if (pte_none(ptep_get(old_ptep)))
> > > +		const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +		int max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> > > +
> > > +		nr = 1;
> > > +		old_pte = ptep_get(old_ptep);
> >
> > You can declare this in the for loop, no need for us to contaminate whole
> > function scope with it.
> >
> > Same with 'nr' in this implementation (though I'd rather you changed it up, see
> > above).
> >
> > > +		if (pte_none(old_pte))
> > >   			continue;
> > >
> > > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> > >   		/*
> > >   		 * If we are remapping a valid PTE, make sure
> > >   		 * to flush TLB before we drop the PTL for the
> > > @@ -252,8 +257,17 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   		 * the TLB entry for the old mapping has been
> > >   		 * flushed.
> > >   		 */
> > > -		if (pte_present(pte))
> > > +		if (pte_present(old_pte)) {
> > > +			if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
> > > +				struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> > > +
> > > +				if (folio && folio_test_large(folio))
> > > +					nr = folio_pte_batch(folio, old_addr, old_ptep,
> > > +					old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
> >
> > Indentation seems completely broken here? I also hate that we're nesting to this
> > degree? Can we please find a way not to?
> >
> > This function is already a bit of a clogged mess, I'd rather we clean up then
> > add to it.
> >
> > (See above again :)
> >
> >
> > > +			}
> > >   			force_flush = true;
> > > +		}
> > > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
> > >   		pte = move_pte(pte, old_addr, new_addr);
> > >   		pte = move_soft_dirty_pte(pte);
> > >
> > > @@ -266,7 +280,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   				else if (is_swap_pte(pte))
> > >   					pte = pte_swp_clear_uffd_wp(pte);
> > >   			}
> > > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > > +			set_ptes(mm, new_addr, new_ptep, pte, nr);
> > >   		}
> > >   	}
> > >
> > > --
> > > 2.30.2
> > >
> >
> > Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
  2025-05-06  9:10   ` Anshuman Khandual
@ 2025-05-06 15:46   ` Matthew Wilcox
  2025-05-07  3:43     ` Dev Jain
  2025-05-07 10:03   ` David Hildenbrand
  2 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2025-05-06 15:46 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, anshuman.khandual,
	ioworker0, yang

On Tue, May 06, 2025 at 10:30:55AM +0530, Dev Jain wrote:
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);

umm ... PFN is in units of PAGE_SIZE.  Did you mean == 1?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06 15:46   ` Matthew Wilcox
@ 2025-05-07  3:43     ` Dev Jain
  0 siblings, 0 replies; 27+ messages in thread
From: Dev Jain @ 2025-05-07  3:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, anshuman.khandual,
	ioworker0, yang



On 06/05/25 9:16 pm, Matthew Wilcox wrote:
> On Tue, May 06, 2025 at 10:30:55AM +0530, Dev Jain wrote:
>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> 
> umm ... PFN is in units of PAGE_SIZE.  Did you mean == 1?

My bad, thanks for pointing out.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
  2025-05-06  9:10   ` Anshuman Khandual
  2025-05-06 15:46   ` Matthew Wilcox
@ 2025-05-07 10:03   ` David Hildenbrand
  2025-05-08  5:02     ` Dev Jain
  2 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-05-07 10:03 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang

On 06.05.25 07:00, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   include/linux/pgtable.h | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..28e21fcc7837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>   }
>   #endif
>   
> +/* Caller must ensure that ptep + 1 exists */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> +}

So, where we want to use that is:

if (pte_present(old_pte)) {
	if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
		struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);

		if (folio && folio_test_large(folio))
			nr = folio_pte_batch(folio, old_addr, old_ptep,
					     old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
	}
}

where we won't need the folio later. But want it all part of the same folio?


And the simpler version would be


if (pte_present(old_pte)) {
	if (max_nr != 1) {
		struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);

		if (folio && folio_test_large(folio))
			nr = folio_pte_batch(folio, old_addr, old_ptep,
					     old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
	}
}


Two things come to mind:

(1) Do we *really* care about the vm_normal_folio() + folio_test_large() call that much, that you
have to add this optimization ahead of times ? :)

(2) Do we really need "must be part of the same folio", or could be just batch over present
ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
might be better.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-07 10:03   ` David Hildenbrand
@ 2025-05-08  5:02     ` Dev Jain
  2025-05-08 10:55       ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-05-08  5:02 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang



On 07/05/25 3:33 pm, David Hildenbrand wrote:
> On 06.05.25 07:00, Dev Jain wrote:
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To 
>> generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..28e21fcc7837 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>> +/* Caller must ensure that ptep + 1 exists */
>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>> +{
>> +    pte_t *next_ptep, next_pte;
>> +
>> +    if (pte_batch_hint(ptep, pte) != 1)
>> +        return true;
>> +
>> +    next_ptep = ptep + 1;
>> +    next_pte = ptep_get(next_ptep);
>> +    if (!pte_present(next_pte))
>> +        return false;
>> +
>> +    return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
>> +}
> 
> So, where we want to use that is:
> 
> if (pte_present(old_pte)) {
>      if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
>          struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> 
>          if (folio && folio_test_large(folio))
>              nr = folio_pte_batch(folio, old_addr, old_ptep,
>                           old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>      }
> }
> 
> where we won't need the folio later. But want it all part of the same 
> folio?
> 
> 
> And the simpler version would be
> 
> 
> if (pte_present(old_pte)) {
>      if (max_nr != 1) {
>          struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> 
>          if (folio && folio_test_large(folio))
>              nr = folio_pte_batch(folio, old_addr, old_ptep,
>                           old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>      }
> }
> 
> 
> Two things come to mind:
> 
> (1) Do we *really* care about the vm_normal_folio() + folio_test_large() 
> call that much, that you
> have to add this optimization ahead of times ? :)

For my mprotect series, I see a regression of almost (7.7 - 7.65)/7.7 = 
0.65% for the small folio case. I am happy to remove this 
micro-optimization if that is the preference.

> 
> (2) Do we really need "must be part of the same folio", or could be just 
> batch over present
> ptes that map consecutive PFNs? In that case, a helper that avoids 
> folio_pte_batch() completely
> might be better.
> 
I am not sure I get you here. folio_pte_batch() seems to be the simplest 
thing we can do as being done around in the code elsewhere, I am not 
aware of any alternate.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-08  5:02     ` Dev Jain
@ 2025-05-08 10:55       ` David Hildenbrand
  2025-05-09  5:25         ` Dev Jain
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-05-08 10:55 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang


>> (2) Do we really need "must be part of the same folio", or could be just
>> batch over present
>> ptes that map consecutive PFNs? In that case, a helper that avoids
>> folio_pte_batch() completely
>> might be better.
>>
> I am not sure I get you here. folio_pte_batch() seems to be the simplest
> thing we can do as being done around in the code elsewhere, I am not
> aware of any alternate.

If we don't need the folio, then we can have a batching function that
doesn't require the folio.

Likely, we could even factor that (non-folio batching) out from folio_pte_batch().
The recent fix [1] might make that easier. See below.


So my question is: is something relying on all of these PTEs to point at the same folio?

[1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz


Something like this: (would need kerneldoc, probably remove "addr" parameter from folio_pte_batch(),
and look into other related cleanups as discussed with Andrew)


 From f56f67ee5ae9879adb99a8da37fa7ec848c4d256 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Thu, 8 May 2025 12:53:52 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/internal.h | 84 ++++++++++++++++++++++++++++-----------------------
  1 file changed, 46 insertions(+), 38 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 25a29872c634b..53ff8f8a7c8f9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -217,36 +217,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  	return pte_wrprotect(pte_mkold(pte));
  }
  
-/**
- * folio_pte_batch - detect a PTE batch for a large folio
- * @folio: The large folio to detect a PTE batch for.
- * @addr: The user virtual address the first page is mapped at.
- * @start_ptep: Page table pointer for the first entry.
- * @pte: Page table entry for the first page.
- * @max_nr: The maximum number of table entries to consider.
- * @flags: Flags to modify the PTE batch semantics.
- * @any_writable: Optional pointer to indicate whether any entry except the
- *		  first one is writable.
- * @any_young: Optional pointer to indicate whether any entry except the
- *		  first one is young.
- * @any_dirty: Optional pointer to indicate whether any entry except the
- *		  first one is dirty.
- *
- * Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same large folio.
- *
- * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
- * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
- * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
- *
- * start_ptep must map any page of the folio. max_nr must be at least one and
- * must be limited by the caller so scanning cannot exceed a single page table.
- *
- * Return: the number of table entries in the batch.
- */
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
-		bool *any_writable, bool *any_young, bool *any_dirty)
+static inline int pte_batch(pte_t *start_ptep, pte_t pte, int max_nr,
+		fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty)
  {
  	pte_t expected_pte, *ptep;
  	bool writable, young, dirty;
@@ -259,14 +231,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  	if (any_dirty)
  		*any_dirty = false;
  
-	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
-	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
-	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
-
-	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
-	max_nr = min_t(unsigned long, max_nr,
-		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
-
  	nr = pte_batch_hint(start_ptep, pte);
  	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
  	ptep = start_ptep + nr;
@@ -300,6 +264,50 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  	return min(nr, max_nr);
  }
  
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @addr: The user virtual address the first page is mapped at.
+ * @start_ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ * @flags: Flags to modify the PTE batch semantics.
+ * @any_writable: Optional pointer to indicate whether any entry except the
+ *		  first one is writable.
+ * @any_young: Optional pointer to indicate whether any entry except the
+ *		  first one is young.
+ * @any_dirty: Optional pointer to indicate whether any entry except the
+ *		  first one is dirty.
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
+ * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
+ *
+ * start_ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+		bool *any_writable, bool *any_young, bool *any_dirty)
+{
+
+	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
+	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+
+	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
+	max_nr = min_t(unsigned long, max_nr,
+		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
+
+	return pte_batch(start_ptep, pte, max_nr, flags, any_writable, any_young,
+			 any_dirty);
+}
+
  /**
   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
   *	 forward or backward by delta
-- 
2.49.0


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-08 10:55       ` David Hildenbrand
@ 2025-05-09  5:25         ` Dev Jain
  2025-05-09  9:16           ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-05-09  5:25 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang



On 08/05/25 4:25 pm, David Hildenbrand wrote:
> 
>>> (2) Do we really need "must be part of the same folio", or could be just
>>> batch over present
>>> ptes that map consecutive PFNs? In that case, a helper that avoids
>>> folio_pte_batch() completely
>>> might be better.
>>>
>> I am not sure I get you here. folio_pte_batch() seems to be the simplest
>> thing we can do as being done around in the code elsewhere, I am not
>> aware of any alternate.
> 
> If we don't need the folio, then we can have a batching function that
> doesn't require the folio.
> 
> Likely, we could even factor that (non-folio batching) out from 
> folio_pte_batch().
> The recent fix [1] might make that easier. See below.
> 
> 
> So my question is: is something relying on all of these PTEs to point at 
> the same folio?

Hmm...get_and_clear_full_ptes, as you say in another mail, will require 
that...

> 
> [1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz
> 
> 
> Something like this: (would need kerneldoc, probably remove "addr" 
> parameter from folio_pte_batch(),
> and look into other related cleanups as discussed with Andrew)

I like this refactoring! Can you tell the commit hash on which you make 
the patch, I cannot apply it.

So we need to collect/not collect a/d bits according to whether the pte 
batch belongs to a large folio/small folios. Seems complicated :)

> 
> 
>  From f56f67ee5ae9879adb99a8da37fa7ec848c4d256 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 8 May 2025 12:53:52 +0200
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/internal.h | 84 ++++++++++++++++++++++++++++-----------------------
>   1 file changed, 46 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 25a29872c634b..53ff8f8a7c8f9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -217,36 +217,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t 
> pte, fpb_t flags)
>       return pte_wrprotect(pte_mkold(pte));
>   }
> 
> -/**
> - * folio_pte_batch - detect a PTE batch for a large folio
> - * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
> - * @start_ptep: Page table pointer for the first entry.
> - * @pte: Page table entry for the first page.
> - * @max_nr: The maximum number of table entries to consider.
> - * @flags: Flags to modify the PTE batch semantics.
> - * @any_writable: Optional pointer to indicate whether any entry except 
> the
> - *          first one is writable.
> - * @any_young: Optional pointer to indicate whether any entry except the
> - *          first one is young.
> - * @any_dirty: Optional pointer to indicate whether any entry except the
> - *          first one is dirty.
> - *
> - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio.
> - *
> - * All PTEs inside a PTE batch have the same PTE bits set, excluding 
> the PFN,
> - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> - *
> - * start_ptep must map any page of the folio. max_nr must be at least 
> one and
> - * must be limited by the caller so scanning cannot exceed a single 
> page table.
> - *
> - * Return: the number of table entries in the batch.
> - */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -        pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> -        bool *any_writable, bool *any_young, bool *any_dirty)
> +static inline int pte_batch(pte_t *start_ptep, pte_t pte, int max_nr,
> +        fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty)
>   {
>       pte_t expected_pte, *ptep;
>       bool writable, young, dirty;
> @@ -259,14 +231,6 @@ static inline int folio_pte_batch(struct folio 
> *folio, unsigned long addr,
>       if (any_dirty)
>           *any_dirty = false;
> 
> -    VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> -    VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> -    VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, 
> folio);
> -
> -    /* Limit max_nr to the actual remaining PFNs in the folio we could 
> batch. */
> -    max_nr = min_t(unsigned long, max_nr,
> -               folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
> -
>       nr = pte_batch_hint(start_ptep, pte);
>       expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), 
> flags);
>       ptep = start_ptep + nr;
> @@ -300,6 +264,50 @@ static inline int folio_pte_batch(struct folio 
> *folio, unsigned long addr,
>       return min(nr, max_nr);
>   }
> 
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except 
> the
> + *          first one is writable.
> + * @any_young: Optional pointer to indicate whether any entry except the
> + *          first one is young.
> + * @any_dirty: Optional pointer to indicate whether any entry except the
> + *          first one is dirty.
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding 
> the PFN,
> + * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> + * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> + *
> + * start_ptep must map any page of the folio. max_nr must be at least 
> one and
> + * must be limited by the caller so scanning cannot exceed a single 
> page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +        pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> +        bool *any_writable, bool *any_young, bool *any_dirty)
> +{
> +
> +    VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> +    VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> +    VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, 
> folio);
> +
> +    /* Limit max_nr to the actual remaining PFNs in the folio we could 
> batch. */
> +    max_nr = min_t(unsigned long, max_nr,
> +               folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
> +
> +    return pte_batch(start_ptep, pte, max_nr, flags, any_writable, 
> any_young,
> +             any_dirty);
> +}
> +
>   /**
>    * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>    *     forward or backward by delta



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm: Add generic helper to hint a large folio
  2025-05-09  5:25         ` Dev Jain
@ 2025-05-09  9:16           ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-05-09  9:16 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang

On 09.05.25 07:25, Dev Jain wrote:
> 
> 
> On 08/05/25 4:25 pm, David Hildenbrand wrote:
>>
>>>> (2) Do we really need "must be part of the same folio", or could be just
>>>> batch over present
>>>> ptes that map consecutive PFNs? In that case, a helper that avoids
>>>> folio_pte_batch() completely
>>>> might be better.
>>>>
>>> I am not sure I get you here. folio_pte_batch() seems to be the simplest
>>> thing we can do as being done around in the code elsewhere, I am not
>>> aware of any alternate.
>>
>> If we don't need the folio, then we can have a batching function that
>> doesn't require the folio.
>>
>> Likely, we could even factor that (non-folio batching) out from
>> folio_pte_batch().
>> The recent fix [1] might make that easier. See below.
>>
>>
>> So my question is: is something relying on all of these PTEs to point at
>> the same folio?
> 
> Hmm...get_and_clear_full_ptes, as you say in another mail, will require
> that...
> 
>>
>> [1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz
>>
>>
>> Something like this: (would need kerneldoc, probably remove "addr"
>> parameter from folio_pte_batch(),
>> and look into other related cleanups as discussed with Andrew)
> 
> I like this refactoring! Can you tell the commit hash on which you make
> the patch, I cannot apply it.

Oh, it was just on top of my private version of [1]. It should now be in 
mm-new (or mm-unstable, did not check).

But as raised in my other mail, get_and_clear_full_ptes() might be 
problematic across folios.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-05-09  9:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-06  5:00 [PATCH 0/3] Optimize mremap() by PTE-batching Dev Jain
2025-05-06  5:00 ` [PATCH 1/3] mm: Call pointers to ptes as ptep Dev Jain
2025-05-06  8:50   ` Anshuman Khandual
2025-05-06  9:05     ` Lorenzo Stoakes
2025-05-06 10:52   ` Lorenzo Stoakes
2025-05-06 11:52     ` Dev Jain
2025-05-06  5:00 ` [PATCH 2/3] mm: Add generic helper to hint a large folio Dev Jain
2025-05-06  9:10   ` Anshuman Khandual
2025-05-06 13:34     ` Lorenzo Stoakes
2025-05-06 15:46   ` Matthew Wilcox
2025-05-07  3:43     ` Dev Jain
2025-05-07 10:03   ` David Hildenbrand
2025-05-08  5:02     ` Dev Jain
2025-05-08 10:55       ` David Hildenbrand
2025-05-09  5:25         ` Dev Jain
2025-05-09  9:16           ` David Hildenbrand
2025-05-06  5:00 ` [PATCH 3/3] mm: Optimize mremap() by PTE batching Dev Jain
2025-05-06 10:10   ` Anshuman Khandual
2025-05-06 10:20     ` Dev Jain
2025-05-06 13:49   ` Lorenzo Stoakes
2025-05-06 14:03     ` Lorenzo Stoakes
2025-05-06 14:10     ` Dev Jain
2025-05-06 14:14       ` Lorenzo Stoakes
2025-05-06  9:16 ` [PATCH 0/3] Optimize mremap() by PTE-batching Anshuman Khandual
2025-05-06 10:22   ` Dev Jain
2025-05-06 10:44     ` Lorenzo Stoakes
2025-05-06 11:53       ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).