linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Optimize mremap() for large folios
@ 2025-05-27  7:50 Dev Jain
  2025-05-27  7:50 ` [PATCH v3 1/2] mm: Call pointers to ptes as ptep Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dev Jain @ 2025-05-27  7:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
through all 16 entries to collect a/d bits), and we also elide extra TLBIs
through get_and_clear_full_ptes, replacing ptep_get_and_clear.

Mapping 1M of memory with 64K folios, memsetting it, remapping it to
src + 1M, and munmapping it 10,000 times, the average execution time
reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
on Apple M3 (arm64). No regression is observed for small folios.

The patchset is based on mm-unstable (6ebffe676fcf).

Test program for reference:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

#define SIZE (1UL << 20) // 1M

int main(void) {
    void *new_addr, *addr;

    for (int i = 0; i < 10000; ++i) {
        addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (addr == MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        memset(addr, 0xAA, SIZE);

        new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
        if (new_addr != (addr + SIZE)) {
                perror("mremap");
                return 1;
        }
        munmap(new_addr, SIZE);
    }

}

v2->v3:
 - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
   indentation (Lorenzo), fix cover letter description (512K -> 1M)

v1->v2:
 - Expand patch descriptions, move pte declarations to a new line,
   reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
   fix loop iteration (Lorenzo)
 - Merge patch 2 and 3 (Anshuman, Lorenzo)
 - Fix maybe_contiguous_pte_pfns (Willy)

Dev Jain (2):
  mm: Call pointers to ptes as ptep
  mm: Optimize mremap() by PTE batching

 mm/mremap.c | 57 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 42 insertions(+), 15 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/2] mm: Call pointers to ptes as ptep
  2025-05-27  7:50 [PATCH v3 0/2] Optimize mremap() for large folios Dev Jain
@ 2025-05-27  7:50 ` Dev Jain
  2025-05-27  7:50 ` [PATCH v3 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-27 10:50 ` [PATCH v3 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 0 replies; 14+ messages in thread
From: Dev Jain @ 2025-05-27  7:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.

Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84..0163e02e5aa8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	struct vm_area_struct *vma = pmc->old;
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_ptep, *new_ptep;
+	pte_t pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * We don't have to worry about the ordering of src and dst
 	 * pte locks because exclusive mmap_lock prevents deadlock.
 	 */
-	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
-	if (!old_pte) {
+	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	if (!old_ptep) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
 	 * pmdval and do pmd_same() check.
 	 */
-	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
 					   &new_ptl);
-	if (!new_pte) {
-		pte_unmap_unlock(old_pte, old_ptl);
+	if (!new_ptep) {
+		pte_unmap_unlock(old_ptep, old_ptl);
 		err = -EAGAIN;
 		goto out;
 	}
@@ -235,12 +236,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
-				   new_pte++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_pte)))
+	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
+				   new_ptep++, new_addr += PAGE_SIZE) {
+		if (pte_none(ptep_get(old_ptep)))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -258,7 +259,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		pte = move_soft_dirty_pte(pte);
 
 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-			pte_clear(mm, new_addr, new_pte);
+			pte_clear(mm, new_addr, new_ptep);
 		else {
 			if (need_clear_uffd_wp) {
 				if (pte_present(pte))
@@ -266,7 +267,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_pte, pte);
+			set_pte_at(mm, new_addr, new_ptep, pte);
 		}
 	}
 
@@ -275,8 +276,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-	pte_unmap(new_pte - 1);
-	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_unmap(new_ptep - 1);
+	pte_unmap_unlock(old_ptep - 1, old_ptl);
 out:
 	if (pmc->need_rmap_locks)
 		drop_rmap_locks(vma);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27  7:50 [PATCH v3 0/2] Optimize mremap() for large folios Dev Jain
  2025-05-27  7:50 ` [PATCH v3 1/2] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-05-27  7:50 ` Dev Jain
  2025-05-27 10:45   ` Lorenzo Stoakes
  2025-05-27 10:50 ` [PATCH v3 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 1 reply; 14+ messages in thread
From: Dev Jain @ 2025-05-27  7:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 0163e02e5aa8..580b41f8d169 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+/* mremap a batch of PTEs mapping the same large folio */
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t pte, int max_nr)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	struct folio *folio;
+
+	if (max_nr == 1)
+		return 1;
+
+	folio = vm_normal_folio(vma, addr, pte);
+	if (!folio || !folio_test_large(folio))
+		return 1;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
+			       NULL, NULL);
+}
+
 static int move_ptes(struct pagetable_move_control *pmc,
 		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
 {
@@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_ptep, *new_ptep;
-	pte_t pte;
+	pte_t old_pte, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	unsigned long new_addr = pmc->new_addr;
 	unsigned long old_end = old_addr + extent;
 	unsigned long len = old_end - old_addr;
+	int max_nr_ptes;
+	int nr_ptes;
 	int err = 0;
 
 	/*
@@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
-				   new_ptep++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_ptep)))
+	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
+		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
+		nr_ptes = 1;
+		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
+		old_pte = ptep_get(old_ptep);
+		if (pte_none(old_pte))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
-		if (pte_present(pte))
+		if (pte_present(old_pte)) {
+			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
+							 old_pte, max_nr_ptes);
 			force_flush = true;
+		}
+		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
@@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_ptep, pte);
+			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
 		}
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27  7:50 ` [PATCH v3 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-27 10:45   ` Lorenzo Stoakes
  2025-05-27 16:22     ` Dev Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27 10:45 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.

But you're also making this applicable to non-contpte cases?

See below, but the commit message shoud clearly point out this is general
for page table split large folios (unless I've missed something of course!
:)

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 0163e02e5aa8..580b41f8d169 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>
> +/* mremap a batch of PTEs mapping the same large folio */

I think this comment is fairly useless, it basically spells out the function
name.

I'd prefer something like 'determine if a PTE contains physically contiguous
entries which map the same large folio'.

> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +
> +	if (max_nr == 1)
> +		return 1;
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (!folio || !folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +			       NULL, NULL);
> +}

The code is much better however! :)

> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long new_addr = pmc->new_addr;
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
> +	int max_nr_ptes;
> +	int nr_ptes;
>  	int err = 0;
>
>  	/*
> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_ptep)))
> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> +		nr_ptes = 1;
> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +							 old_pte, max_nr_ptes);
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);

Just to clarify, in the previous revision you said:

"Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
and figure the split THP batch out, but then get_and_clear_full_ptes() will be
gathering a/d bits and smearing them across the batch, which will be incorrect."

But... this will be triggered for page table split large folio no?

So is there something wrong here or not?

>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);

The code looks much better here after refactoring, however!

>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] Optimize mremap() for large folios
  2025-05-27  7:50 [PATCH v3 0/2] Optimize mremap() for large folios Dev Jain
  2025-05-27  7:50 ` [PATCH v3 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-27  7:50 ` [PATCH v3 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-27 10:50 ` Lorenzo Stoakes
  2025-05-27 16:26   ` Dev Jain
  2 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27 10:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

I seem to recall we agreed you'd hold off on this until the mprotect work
was done :>) I see a lot of review there and was expecting a respin, unless
I'm mistaken?

At any rate we're in the merge window now so it's maybe not quite as
important now :)

We're pretty close to this being done anyway, just need some feedback on
points raised (obviously David et al. may have further comments).

Thanks, Lorenzo

On Tue, May 27, 2025 at 01:20:47PM +0530, Dev Jain wrote:
> Currently move_ptes() iterates through ptes one by one. If the underlying
> folio mapped by the ptes is large, we can process those ptes in a batch
> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
> For arm64 specifically, this results in a 16x reduction in the number of
> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
> through get_and_clear_full_ptes, replacing ptep_get_and_clear.

OK this is more general than the stuff in 2/2, so you are doing this work
for page-table split large folios also.

I do think this _should_ be fine for that unless I've missed something. At
any rate I've commented on this in 2/2.

>
> Mapping 1M of memory with 64K folios, memsetting it, remapping it to
> src + 1M, and munmapping it 10,000 times, the average execution time
> reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
> on Apple M3 (arm64). No regression is observed for small folios.
>
> The patchset is based on mm-unstable (6ebffe676fcf).
>
> Test program for reference:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <errno.h>
>
> #define SIZE (1UL << 20) // 1M
>
> int main(void) {
>     void *new_addr, *addr;
>
>     for (int i = 0; i < 10000; ++i) {
>         addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if (addr == MAP_FAILED) {
>                 perror("mmap");
>                 return 1;
>         }
>         memset(addr, 0xAA, SIZE);
>
>         new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>         if (new_addr != (addr + SIZE)) {
>                 perror("mremap");
>                 return 1;
>         }
>         munmap(new_addr, SIZE);
>     }
>
> }
>
> v2->v3:
>  - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
>    indentation (Lorenzo), fix cover letter description (512K -> 1M)
>
> v1->v2:
>  - Expand patch descriptions, move pte declarations to a new line,
>    reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>    fix loop iteration (Lorenzo)
>  - Merge patch 2 and 3 (Anshuman, Lorenzo)
>  - Fix maybe_contiguous_pte_pfns (Willy)
>
> Dev Jain (2):
>   mm: Call pointers to ptes as ptep
>   mm: Optimize mremap() by PTE batching
>
>  mm/mremap.c | 57 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 42 insertions(+), 15 deletions(-)
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27 10:45   ` Lorenzo Stoakes
@ 2025-05-27 16:22     ` Dev Jain
  2025-05-27 16:29       ` Lorenzo Stoakes
  0 siblings, 1 reply; 14+ messages in thread
From: Dev Jain @ 2025-05-27 16:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
> But you're also making this applicable to non-contpte cases?
>
> See below, but the commit message shoud clearly point out this is general
> for page table split large folios (unless I've missed something of course!
> :)
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>>   1 file changed, 33 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 0163e02e5aa8..580b41f8d169 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>   	return pte;
>>   }
>>
>> +/* mremap a batch of PTEs mapping the same large folio */
> I think this comment is fairly useless, it basically spells out the function
> name.
>
> I'd prefer something like 'determine if a PTE contains physically contiguous
> entries which map the same large folio'.

I'd rather prefer dropping the comment altogether, the function is fairly obvious : )


>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	struct folio *folio;
>> +
>> +	if (max_nr == 1)
>> +		return 1;
>> +
>> +	folio = vm_normal_folio(vma, addr, pte);
>> +	if (!folio || !folio_test_large(folio))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +			       NULL, NULL);
>> +}
> The code is much better however! :)
>
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pte_t *old_ptep, *new_ptep;
>> -	pte_t pte;
>> +	pte_t old_pte, pte;
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long new_addr = pmc->new_addr;
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>> +	int max_nr_ptes;
>> +	int nr_ptes;
>>   	int err = 0;
>>
>>   	/*
>> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>
>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_ptep)))
>> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>> +		nr_ptes = 1;
>> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> +		old_pte = ptep_get(old_ptep);
>> +		if (pte_none(old_pte))
>>   			continue;
>>
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +							 old_pte, max_nr_ptes);
>>   			force_flush = true;
>> +		}
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> Just to clarify, in the previous revision you said:
>
> "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
> and figure the split THP batch out, but then get_and_clear_full_ptes() will be
> gathering a/d bits and smearing them across the batch, which will be incorrect."
>
> But... this will be triggered for page table split large folio no?
>
> So is there something wrong here or not?

Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
saying in the other email), the batch must belong to the same folio. Since split
THP means a small folio, nr_ptes will be 1.



>
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>
>> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> The code looks much better here after refactoring, however!
>
>>   		}
>>   	}
>>
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] Optimize mremap() for large folios
  2025-05-27 10:50 ` [PATCH v3 0/2] Optimize mremap() for large folios Lorenzo Stoakes
@ 2025-05-27 16:26   ` Dev Jain
  2025-05-27 16:32     ` Lorenzo Stoakes
  0 siblings, 1 reply; 14+ messages in thread
From: Dev Jain @ 2025-05-27 16:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 27/05/25 4:20 pm, Lorenzo Stoakes wrote:
> I seem to recall we agreed you'd hold off on this until the mprotect work
> was done :>) I see a lot of review there and was expecting a respin, unless


Oh, my interpretation was that you requested to hold this off for a bit to get
some review on the mprotect series first, apologies if you meant otherwise! I
posted that one or so week before so I thought enough time has passed : )


> I'm mistaken?
>
> At any rate we're in the merge window now so it's maybe not quite as
> important now :)
>
> We're pretty close to this being done anyway, just need some feedback on
> points raised (obviously David et al. may have further comments).
>
> Thanks, Lorenzo
>
> On Tue, May 27, 2025 at 01:20:47PM +0530, Dev Jain wrote:
>> Currently move_ptes() iterates through ptes one by one. If the underlying
>> folio mapped by the ptes is large, we can process those ptes in a batch
>> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
>> For arm64 specifically, this results in a 16x reduction in the number of
>> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
>> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
>> through get_and_clear_full_ptes, replacing ptep_get_and_clear.
> OK this is more general than the stuff in 2/2, so you are doing this work
> for page-table split large folios also.
>
> I do think this _should_ be fine for that unless I've missed something. At
> any rate I've commented on this in 2/2.
>
>> Mapping 1M of memory with 64K folios, memsetting it, remapping it to
>> src + 1M, and munmapping it 10,000 times, the average execution time
>> reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
>> on Apple M3 (arm64). No regression is observed for small folios.
>>
>> The patchset is based on mm-unstable (6ebffe676fcf).
>>
>> Test program for reference:
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <errno.h>
>>
>> #define SIZE (1UL << 20) // 1M
>>
>> int main(void) {
>>      void *new_addr, *addr;
>>
>>      for (int i = 0; i < 10000; ++i) {
>>          addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>>                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>          if (addr == MAP_FAILED) {
>>                  perror("mmap");
>>                  return 1;
>>          }
>>          memset(addr, 0xAA, SIZE);
>>
>>          new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>>          if (new_addr != (addr + SIZE)) {
>>                  perror("mremap");
>>                  return 1;
>>          }
>>          munmap(new_addr, SIZE);
>>      }
>>
>> }
>>
>> v2->v3:
>>   - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
>>     indentation (Lorenzo), fix cover letter description (512K -> 1M)
>>
>> v1->v2:
>>   - Expand patch descriptions, move pte declarations to a new line,
>>     reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>>     fix loop iteration (Lorenzo)
>>   - Merge patch 2 and 3 (Anshuman, Lorenzo)
>>   - Fix maybe_contiguous_pte_pfns (Willy)
>>
>> Dev Jain (2):
>>    mm: Call pointers to ptes as ptep
>>    mm: Optimize mremap() by PTE batching
>>
>>   mm/mremap.c | 57 +++++++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 42 insertions(+), 15 deletions(-)
>>
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27 16:22     ` Dev Jain
@ 2025-05-27 16:29       ` Lorenzo Stoakes
  2025-05-27 16:38         ` Dev Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27 16:29 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, May 27, 2025 at 09:52:47PM +0530, Dev Jain wrote:
>
> On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
> > On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
> > > Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > > are painted with the contig bit, then ptep_get() will iterate through all 16
> > > entries to collect a/d bits. Hence this optimization will result in a 16x
> > > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > > will eventually call contpte_try_unfold() on every contig block, thus
> > > flushing the TLB for the complete large folio range. Instead, use
> > > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > > do them on the starting and ending contig block.
> > But you're also making this applicable to non-contpte cases?
> >
> > See below, but the commit message shoud clearly point out this is general
> > for page table split large folios (unless I've missed something of course!
> > :)
> >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
> > >   1 file changed, 33 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 0163e02e5aa8..580b41f8d169 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> > >   	return pte;
> > >   }
> > >
> > > +/* mremap a batch of PTEs mapping the same large folio */
> > I think this comment is fairly useless, it basically spells out the function
> > name.
> >
> > I'd prefer something like 'determine if a PTE contains physically contiguous
> > entries which map the same large folio'.
>
> I'd rather prefer dropping the comment altogether, the function is fairly obvious : )

Sure fine.

>
>
> > > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > > +		pte_t *ptep, pte_t pte, int max_nr)
> > > +{
> > > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +	struct folio *folio;
> > > +
> > > +	if (max_nr == 1)
> > > +		return 1;
> > > +
> > > +	folio = vm_normal_folio(vma, addr, pte);
> > > +	if (!folio || !folio_test_large(folio))
> > > +		return 1;
> > > +
> > > +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> > > +			       NULL, NULL);
> > > +}
> > The code is much better however! :)
> >
> > > +
> > >   static int move_ptes(struct pagetable_move_control *pmc,
> > >   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> > >   {
> > > @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >   	struct mm_struct *mm = vma->vm_mm;
> > >   	pte_t *old_ptep, *new_ptep;
> > > -	pte_t pte;
> > > +	pte_t old_pte, pte;
> > >   	pmd_t dummy_pmdval;
> > >   	spinlock_t *old_ptl, *new_ptl;
> > >   	bool force_flush = false;
> > > @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	unsigned long new_addr = pmc->new_addr;
> > >   	unsigned long old_end = old_addr + extent;
> > >   	unsigned long len = old_end - old_addr;
> > > +	int max_nr_ptes;
> > > +	int nr_ptes;
> > >   	int err = 0;
> > >
> > >   	/*
> > > @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	flush_tlb_batched_pending(vma->vm_mm);
> > >   	arch_enter_lazy_mmu_mode();
> > >
> > > -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > > -				   new_ptep++, new_addr += PAGE_SIZE) {
> > > -		if (pte_none(ptep_get(old_ptep)))
> > > +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> > > +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> > > +		nr_ptes = 1;
> > > +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> > > +		old_pte = ptep_get(old_ptep);
> > > +		if (pte_none(old_pte))
> > >   			continue;
> > >
> > > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> > >   		/*
> > >   		 * If we are remapping a valid PTE, make sure
> > >   		 * to flush TLB before we drop the PTL for the
> > > @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   		 * the TLB entry for the old mapping has been
> > >   		 * flushed.
> > >   		 */
> > > -		if (pte_present(pte))
> > > +		if (pte_present(old_pte)) {
> > > +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> > > +							 old_pte, max_nr_ptes);
> > >   			force_flush = true;
> > > +		}
> > > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> > Just to clarify, in the previous revision you said:
> >
> > "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
> > and figure the split THP batch out, but then get_and_clear_full_ptes() will be
> > gathering a/d bits and smearing them across the batch, which will be incorrect."
> >
> > But... this will be triggered for page table split large folio no?
> >
> > So is there something wrong here or not?
>
> Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
> saying in the other email), the batch must belong to the same folio. Since split
> THP means a small folio, nr_ptes will be 1.

I'm not sure I follow - keep in mind there's two kinds of splitting - folio
splitting and page table splitting.

If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
large folio. The folio itself is not split, so nr_ptes surely will be equal to
something >1 here right?

I hit this in my MREMAP_RELOCATE_ANON work - where I had to take special care to
differentiate between these cases.

And the comment for folio_pte_batch() states 'Detect a PTE batch: consecutive
(present) PTEs that map consecutive pages of the same large folio.' - so I don't
see why this would not hit this case?

I may be missing something however!

>
>
>
> >
> > >   		pte = move_pte(pte, old_addr, new_addr);
> > >   		pte = move_soft_dirty_pte(pte);
> > >
> > > @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   				else if (is_swap_pte(pte))
> > >   					pte = pte_swp_clear_uffd_wp(pte);
> > >   			}
> > > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > > +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> > The code looks much better here after refactoring, however!
> >
> > >   		}
> > >   	}
> > >
> > > --
> > > 2.30.2
> > >


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] Optimize mremap() for large folios
  2025-05-27 16:26   ` Dev Jain
@ 2025-05-27 16:32     ` Lorenzo Stoakes
  0 siblings, 0 replies; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27 16:32 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, May 27, 2025 at 09:56:37PM +0530, Dev Jain wrote:
>
> On 27/05/25 4:20 pm, Lorenzo Stoakes wrote:
> > I seem to recall we agreed you'd hold off on this until the mprotect work
> > was done :>) I see a lot of review there and was expecting a respin, unless
>
>
> Oh, my interpretation was that you requested to hold this off for a bit to get
> some review on the mprotect series first, apologies if you meant otherwise! I
> posted that one or so week before so I thought enough time has passed : )

Yeah sorry maybe I wasn't clear. At any rate, I don't think we're miles off here
once we resolve the questions, so doesn't matter too much... :)

>
>
> > I'm mistaken?
> >
> > At any rate we're in the merge window now so it's maybe not quite as
> > important now :)
> >
> > We're pretty close to this being done anyway, just need some feedback on
> > points raised (obviously David et al. may have further comments).
> >
> > Thanks, Lorenzo
> >
> > On Tue, May 27, 2025 at 01:20:47PM +0530, Dev Jain wrote:
> > > Currently move_ptes() iterates through ptes one by one. If the underlying
> > > folio mapped by the ptes is large, we can process those ptes in a batch
> > > using folio_pte_batch(), thus clearing and setting the PTEs in one go.
> > > For arm64 specifically, this results in a 16x reduction in the number of
> > > ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
> > > through all 16 entries to collect a/d bits), and we also elide extra TLBIs
> > > through get_and_clear_full_ptes, replacing ptep_get_and_clear.
> > OK this is more general than the stuff in 2/2, so you are doing this work
> > for page-table split large folios also.
> >
> > I do think this _should_ be fine for that unless I've missed something. At
> > any rate I've commented on this in 2/2.
> >
> > > Mapping 1M of memory with 64K folios, memsetting it, remapping it to
> > > src + 1M, and munmapping it 10,000 times, the average execution time
> > > reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
> > > on Apple M3 (arm64). No regression is observed for small folios.
> > >
> > > The patchset is based on mm-unstable (6ebffe676fcf).
> > >
> > > Test program for reference:
> > >
> > > #define _GNU_SOURCE
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <unistd.h>
> > > #include <sys/mman.h>
> > > #include <string.h>
> > > #include <errno.h>
> > >
> > > #define SIZE (1UL << 20) // 1M
> > >
> > > int main(void) {
> > >      void *new_addr, *addr;
> > >
> > >      for (int i = 0; i < 10000; ++i) {
> > >          addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
> > >                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >          if (addr == MAP_FAILED) {
> > >                  perror("mmap");
> > >                  return 1;
> > >          }
> > >          memset(addr, 0xAA, SIZE);
> > >
> > >          new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
> > >          if (new_addr != (addr + SIZE)) {
> > >                  perror("mremap");
> > >                  return 1;
> > >          }
> > >          munmap(new_addr, SIZE);
> > >      }
> > >
> > > }
> > >
> > > v2->v3:
> > >   - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
> > >     indentation (Lorenzo), fix cover letter description (512K -> 1M)
> > >
> > > v1->v2:
> > >   - Expand patch descriptions, move pte declarations to a new line,
> > >     reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
> > >     fix loop iteration (Lorenzo)
> > >   - Merge patch 2 and 3 (Anshuman, Lorenzo)
> > >   - Fix maybe_contiguous_pte_pfns (Willy)
> > >
> > > Dev Jain (2):
> > >    mm: Call pointers to ptes as ptep
> > >    mm: Optimize mremap() by PTE batching
> > >
> > >   mm/mremap.c | 57 +++++++++++++++++++++++++++++++++++++++--------------
> > >   1 file changed, 42 insertions(+), 15 deletions(-)
> > >
> > > --
> > > 2.30.2
> > >


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27 16:29       ` Lorenzo Stoakes
@ 2025-05-27 16:38         ` Dev Jain
  2025-05-27 16:46           ` Lorenzo Stoakes
  0 siblings, 1 reply; 14+ messages in thread
From: Dev Jain @ 2025-05-27 16:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 09:52:47PM +0530, Dev Jain wrote:
>> On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
>>> On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
>>>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>>>> are painted with the contig bit, then ptep_get() will iterate through all 16
>>>> entries to collect a/d bits. Hence this optimization will result in a 16x
>>>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>>>> will eventually call contpte_try_unfold() on every contig block, thus
>>>> flushing the TLB for the complete large folio range. Instead, use
>>>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>>>> do them on the starting and ending contig block.
>>> But you're also making this applicable to non-contpte cases?
>>>
>>> See below, but the commit message shoud clearly point out this is general
>>> for page table split large folios (unless I've missed something of course!
>>> :)
>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>>>>    1 file changed, 33 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 0163e02e5aa8..580b41f8d169 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>>>    	return pte;
>>>>    }
>>>>
>>>> +/* mremap a batch of PTEs mapping the same large folio */
>>> I think this comment is fairly useless, it basically spells out the function
>>> name.
>>>
>>> I'd prefer something like 'determine if a PTE contains physically contiguous
>>> entries which map the same large folio'.
>> I'd rather prefer dropping the comment altogether, the function is fairly obvious : )
> Sure fine.
>
>>
>>>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>>> +		pte_t *ptep, pte_t pte, int max_nr)
>>>> +{
>>>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +	struct folio *folio;
>>>> +
>>>> +	if (max_nr == 1)
>>>> +		return 1;
>>>> +
>>>> +	folio = vm_normal_folio(vma, addr, pte);
>>>> +	if (!folio || !folio_test_large(folio))
>>>> +		return 1;
>>>> +
>>>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>>>> +			       NULL, NULL);
>>>> +}
>>> The code is much better however! :)
>>>
>>>> +
>>>>    static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>>>    {
>>>> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>    	pte_t *old_ptep, *new_ptep;
>>>> -	pte_t pte;
>>>> +	pte_t old_pte, pte;
>>>>    	pmd_t dummy_pmdval;
>>>>    	spinlock_t *old_ptl, *new_ptl;
>>>>    	bool force_flush = false;
>>>> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	unsigned long new_addr = pmc->new_addr;
>>>>    	unsigned long old_end = old_addr + extent;
>>>>    	unsigned long len = old_end - old_addr;
>>>> +	int max_nr_ptes;
>>>> +	int nr_ptes;
>>>>    	int err = 0;
>>>>
>>>>    	/*
>>>> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	flush_tlb_batched_pending(vma->vm_mm);
>>>>    	arch_enter_lazy_mmu_mode();
>>>>
>>>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>>>> -		if (pte_none(ptep_get(old_ptep)))
>>>> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>>>> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>>> +		nr_ptes = 1;
>>>> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>>> +		old_pte = ptep_get(old_ptep);
>>>> +		if (pte_none(old_pte))
>>>>    			continue;
>>>>
>>>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>>>    		/*
>>>>    		 * If we are remapping a valid PTE, make sure
>>>>    		 * to flush TLB before we drop the PTL for the
>>>> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		 * the TLB entry for the old mapping has been
>>>>    		 * flushed.
>>>>    		 */
>>>> -		if (pte_present(pte))
>>>> +		if (pte_present(old_pte)) {
>>>> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>>> +							 old_pte, max_nr_ptes);
>>>>    			force_flush = true;
>>>> +		}
>>>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>>> Just to clarify, in the previous revision you said:
>>>
>>> "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
>>> and figure the split THP batch out, but then get_and_clear_full_ptes() will be
>>> gathering a/d bits and smearing them across the batch, which will be incorrect."
>>>
>>> But... this will be triggered for page table split large folio no?
>>>
>>> So is there something wrong here or not?
>> Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
>> saying in the other email), the batch must belong to the same folio. Since split
>> THP means a small folio, nr_ptes will be 1.
> I'm not sure I follow - keep in mind there's two kinds of splitting - folio
> splitting and page table splitting.
>
> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> large folio. The folio itself is not split, so nr_ptes surely will be equal to
> something >1 here right?


Thanks for elaborating.

So,

Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
disappears.

Case 2: page table splitting => consec PTEs point to the same large folio
=> nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
new ptes, which is correct because we are still pointing to the same large
folio.


>
> I hit this in my MREMAP_RELOCATE_ANON work - where I had to take special care to
> differentiate between these cases.
>
> And the comment for folio_pte_batch() states 'Detect a PTE batch: consecutive
> (present) PTEs that map consecutive pages of the same large folio.' - so I don't
> see why this would not hit this case?
>
> I may be missing something however!
>
>>
>>
>>>>    		pte = move_pte(pte, old_addr, new_addr);
>>>>    		pte = move_soft_dirty_pte(pte);
>>>>
>>>> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    				else if (is_swap_pte(pte))
>>>>    					pte = pte_swp_clear_uffd_wp(pte);
>>>>    			}
>>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>> The code looks much better here after refactoring, however!
>>>
>>>>    		}
>>>>    	}
>>>>
>>>> --
>>>> 2.30.2
>>>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27 16:38         ` Dev Jain
@ 2025-05-27 16:46           ` Lorenzo Stoakes
  2025-05-28  3:32             ` Dev Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27 16:46 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>
> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
[snip]
> > If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> > large folio. The folio itself is not split, so nr_ptes surely will be equal to
> > something >1 here right?
>
>
> Thanks for elaborating.
>
> So,
>
> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
> disappears.
>
> Case 2: page table splitting => consec PTEs point to the same large folio
> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
> new ptes, which is correct because we are still pointing to the same large
> folio.
>

OK awesome, I thought as much, just wanted to make sure :) we are good then.

The accessed/dirty bits really matter at a folio granularity (and especially
with respect to reclaim/writeback which both operate at folio level) so the
smearing as you say is fine.

This patch therefore looks fine, only the trivial comment fixup.

I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.

Sorry to be a pain but could you respin with the commit message for this patch
updated to explicitly mention that the logic applies for the non-contPTE split
PTE case (and therefore also helps performance there)? That and the trivial
thing of dropping that comment.

Then we should be good for a tag unless somebody else spots something
egregious :)

Thanks for this! Good improvement.

[snip]

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-27 16:46           ` Lorenzo Stoakes
@ 2025-05-28  3:32             ` Dev Jain
  2025-05-28  4:49               ` Lorenzo Stoakes
  0 siblings, 1 reply; 14+ messages in thread
From: Dev Jain @ 2025-05-28  3:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> [snip]
>>> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
>>> large folio. The folio itself is not split, so nr_ptes surely will be equal to
>>> something >1 here right?
>>
>> Thanks for elaborating.
>>
>> So,
>>
>> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
>> disappears.
>>
>> Case 2: page table splitting => consec PTEs point to the same large folio
>> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
>> new ptes, which is correct because we are still pointing to the same large
>> folio.
>>
> OK awesome, I thought as much, just wanted to make sure :) we are good then.
>
> The accessed/dirty bits really matter at a folio granularity (and especially
> with respect to reclaim/writeback which both operate at folio level) so the
> smearing as you say is fine.
>
> This patch therefore looks fine, only the trivial comment fixup.
>
> I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.

Thanks!


>
> Sorry to be a pain but could you respin with the commit message for this patch
> updated to explicitly mention that the logic applies for the non-contPTE split
> PTE case (and therefore also helps performance there)? That and the trivial
> thing of dropping that comment.

What do you mean by the non-contpte case? In that case the PTEs do not point
to the same folio or are misaligned, and there will be no optimization. This
patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
in number of TLBIs for contig blocks, both of which happen in the contpte case.

In general, the patch should have a minor improvement on other arches because
we are detecting a batch and processing it together, thus saving on a few
function calls, but the main benefit is for arm64.

>
> Then we should be good for a tag unless somebody else spots something
> egregious :)
>
> Thanks for this! Good improvement.
>
> [snip]
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-28  3:32             ` Dev Jain
@ 2025-05-28  4:49               ` Lorenzo Stoakes
  2025-05-28  6:15                 ` Dev Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-05-28  4:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed, May 28, 2025 at 09:02:26AM +0530, Dev Jain wrote:
>
> On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
> > On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
> > > On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> > [snip]
> > > > If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> > > > large folio. The folio itself is not split, so nr_ptes surely will be equal to
> > > > something >1 here right?
> > >
> > > Thanks for elaborating.
> > >
> > > So,
> > >
> > > Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
> > > disappears.
> > >
> > > Case 2: page table splitting => consec PTEs point to the same large folio
> > > => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
> > > new ptes, which is correct because we are still pointing to the same large
> > > folio.
> > >
> > OK awesome, I thought as much, just wanted to make sure :) we are good then.
> >
> > The accessed/dirty bits really matter at a folio granularity (and especially
> > with respect to reclaim/writeback which both operate at folio level) so the
> > smearing as you say is fine.
> >
> > This patch therefore looks fine, only the trivial comment fixup.
> >
> > I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.
>
> Thanks!
>
>
> >
> > Sorry to be a pain but could you respin with the commit message for this patch
> > updated to explicitly mention that the logic applies for the non-contPTE split
> > PTE case (and therefore also helps performance there)? That and the trivial
> > thing of dropping that comment.
>
> What do you mean by the non-contpte case? In that case the PTEs do not point
> to the same folio or are misaligned, and there will be no optimization. This

Split page table large folio.

> patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
> in number of TLBIs for contig blocks, both of which happen in the contpte case.
>

But it impacts split huge pages. Your code changes this behaviour. We need to
make this clear :)

> In general, the patch should have a minor improvement on other arches because
> we are detecting a batch and processing it together, thus saving on a few
> function calls, but the main benefit is for arm64.

Ack, but you are changing this behaviour. The commit message doesn't make this
clear and seems to imply this only impacts contPTE cases. Or at least isn't
clear enough

A simple additional paragraph like:

'Transparent huge pages which have been split into PTEs will also be impacted,
however the performance gain in this case is expected to be modest'

Will sort this out.

Thanks!

>
> >
> > Then we should be good for a tag unless somebody else spots something
> > egregious :)
> >
> > Thanks for this! Good improvement.
> >
> > [snip]
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching
  2025-05-28  4:49               ` Lorenzo Stoakes
@ 2025-05-28  6:15                 ` Dev Jain
  0 siblings, 0 replies; 14+ messages in thread
From: Dev Jain @ 2025-05-28  6:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 28/05/25 10:19 am, Lorenzo Stoakes wrote:
> On Wed, May 28, 2025 at 09:02:26AM +0530, Dev Jain wrote:
>> On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
>>> On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>>>> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
>>> [snip]
>>>>> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
>>>>> large folio. The folio itself is not split, so nr_ptes surely will be equal to
>>>>> something >1 here right?
>>>> Thanks for elaborating.
>>>>
>>>> So,
>>>>
>>>> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
>>>> disappears.
>>>>
>>>> Case 2: page table splitting => consec PTEs point to the same large folio
>>>> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
>>>> new ptes, which is correct because we are still pointing to the same large
>>>> folio.
>>>>
>>> OK awesome, I thought as much, just wanted to make sure :) we are good then.
>>>
>>> The accessed/dirty bits really matter at a folio granularity (and especially
>>> with respect to reclaim/writeback which both operate at folio level) so the
>>> smearing as you say is fine.
>>>
>>> This patch therefore looks fine, only the trivial comment fixup.
>>>
>>> I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.
>> Thanks!
>>
>>
>>> Sorry to be a pain but could you respin with the commit message for this patch
>>> updated to explicitly mention that the logic applies for the non-contPTE split
>>> PTE case (and therefore also helps performance there)? That and the trivial
>>> thing of dropping that comment.
>> What do you mean by the non-contpte case? In that case the PTEs do not point
>> to the same folio or are misaligned, and there will be no optimization. This
> Split page table large folio.
>
>> patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
>> in number of TLBIs for contig blocks, both of which happen in the contpte case.
>>
> But it impacts split huge pages. Your code changes this behaviour. We need to
> make this clear :)
>
>> In general, the patch should have a minor improvement on other arches because
>> we are detecting a batch and processing it together, thus saving on a few
>> function calls, but the main benefit is for arm64.
> Ack, but you are changing this behaviour. The commit message doesn't make this
> clear and seems to imply this only impacts contPTE cases. Or at least isn't
> clear enough
>
> A simple additional paragraph like:
>
> 'Transparent huge pages which have been split into PTEs will also be impacted,
> however the performance gain in this case is expected to be modest'

Ah okay, sounds good.

>
> Will sort this out.
>
> Thanks!
>
>>> Then we should be good for a tag unless somebody else spots something
>>> egregious :)
>>>
>>> Thanks for this! Good improvement.
>>>
>>> [snip]
>>>
>>> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-28  6:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-27  7:50 [PATCH v3 0/2] Optimize mremap() for large folios Dev Jain
2025-05-27  7:50 ` [PATCH v3 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-05-27  7:50 ` [PATCH v3 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-05-27 10:45   ` Lorenzo Stoakes
2025-05-27 16:22     ` Dev Jain
2025-05-27 16:29       ` Lorenzo Stoakes
2025-05-27 16:38         ` Dev Jain
2025-05-27 16:46           ` Lorenzo Stoakes
2025-05-28  3:32             ` Dev Jain
2025-05-28  4:49               ` Lorenzo Stoakes
2025-05-28  6:15                 ` Dev Jain
2025-05-27 10:50 ` [PATCH v3 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-05-27 16:26   ` Dev Jain
2025-05-27 16:32     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).