[PATCH v2 0/2] Optimize mremap() for large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Optimize mremap() for large folios
@ 2025-05-07  6:02 Dev Jain
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-07  6:02 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
through all 16 entries to collect a/d bits), and we also elide extra TLBIs
through get_and_clear_full_ptes, replacing ptep_get_and_clear.

Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
munmapping it 10,000 times, the average execution time reduces from 1.9 to
1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).

Test program for reference:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

#define SIZE (1UL << 20) // 512 KB

int main(void) {
    void *new_addr, *addr;

    for (int i = 0; i < 10000; ++i) {
        addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (addr == MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        memset(addr, 0xAA, SIZE);

        new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
        if (new_addr != (addr + SIZE)) {
                perror("mremap");
                return 1;
        }
        munmap(new_addr, SIZE);
    }

}

v1->v2:
 - Expand patch descriptions, move pte declarations to a new line,
   reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
   fix loop iteration (Lorenzo)
 - Merge patch 2 and 3 (Anshuman, Lorenzo)
 - Fix maybe_contiguous_pte_pfns (Willy)

Dev Jain (2):
  mm: Call pointers to ptes as ptep
  mm: Optimize mremap() by PTE batching

 include/linux/pgtable.h | 29 ++++++++++++++++++++++
 mm/mremap.c             | 54 +++++++++++++++++++++++++++++------------
 2 files changed, 68 insertions(+), 15 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/2] mm: Call pointers to ptes as ptep
  2025-05-07  6:02 [PATCH v2 0/2] Optimize mremap() for large folios Dev Jain
@ 2025-05-07  6:02 ` Dev Jain
  2025-05-08  1:05   ` Barry Song
                     ` (2 more replies)
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-08 18:35 ` [PATCH v2 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 3 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-07  6:02 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84..0163e02e5aa8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	struct vm_area_struct *vma = pmc->old;
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_ptep, *new_ptep;
+	pte_t pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * We don't have to worry about the ordering of src and dst
 	 * pte locks because exclusive mmap_lock prevents deadlock.
 	 */
-	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
-	if (!old_pte) {
+	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	if (!old_ptep) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
 	 * pmdval and do pmd_same() check.
 	 */
-	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
 					   &new_ptl);
-	if (!new_pte) {
-		pte_unmap_unlock(old_pte, old_ptl);
+	if (!new_ptep) {
+		pte_unmap_unlock(old_ptep, old_ptl);
 		err = -EAGAIN;
 		goto out;
 	}
@@ -235,12 +236,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
-				   new_pte++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_pte)))
+	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
+				   new_ptep++, new_addr += PAGE_SIZE) {
+		if (pte_none(ptep_get(old_ptep)))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -258,7 +259,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		pte = move_soft_dirty_pte(pte);
 
 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-			pte_clear(mm, new_addr, new_pte);
+			pte_clear(mm, new_addr, new_ptep);
 		else {
 			if (need_clear_uffd_wp) {
 				if (pte_present(pte))
@@ -266,7 +267,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_pte, pte);
+			set_pte_at(mm, new_addr, new_ptep, pte);
 		}
 	}
 
@@ -275,8 +276,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-	pte_unmap(new_pte - 1);
-	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_unmap(new_ptep - 1);
+	pte_unmap_unlock(old_ptep - 1, old_ptl);
 out:
 	if (pmc->need_rmap_locks)
 		drop_rmap_locks(vma);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] mm: Call pointers to ptes as ptep
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-05-08  1:05   ` Barry Song
  2025-05-08  6:21   ` Anshuman Khandual
  2025-05-08  9:21   ` Lorenzo Stoakes
  2 siblings, 0 replies; 23+ messages in thread
From: Barry Song @ 2025-05-08  1:05 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed, May 7, 2025 at 6:03 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> type variables with p. No functional change.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  mm/mremap.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84..0163e02e5aa8 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         struct vm_area_struct *vma = pmc->old;
>         bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>         struct mm_struct *mm = vma->vm_mm;
> -       pte_t *old_pte, *new_pte, pte;
> +       pte_t *old_ptep, *new_ptep;
> +       pte_t pte;
>         pmd_t dummy_pmdval;
>         spinlock_t *old_ptl, *new_ptl;
>         bool force_flush = false;
> @@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>          * We don't have to worry about the ordering of src and dst
>          * pte locks because exclusive mmap_lock prevents deadlock.
>          */
> -       old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -       if (!old_pte) {
> +       old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +       if (!old_ptep) {
>                 err = -EAGAIN;
>                 goto out;
>         }
> @@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>          * mmap_lock, so this new_pte page is stable, so there is no need to get
>          * pmdval and do pmd_same() check.
>          */
> -       new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
> +       new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>                                            &new_ptl);
> -       if (!new_pte) {
> -               pte_unmap_unlock(old_pte, old_ptl);
> +       if (!new_ptep) {
> +               pte_unmap_unlock(old_ptep, old_ptl);
>                 err = -EAGAIN;
>                 goto out;
>         }
> @@ -235,12 +236,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
>
> -       for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> -                                  new_pte++, new_addr += PAGE_SIZE) {
> -               if (pte_none(ptep_get(old_pte)))
> +       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> +                                  new_ptep++, new_addr += PAGE_SIZE) {
> +               if (pte_none(ptep_get(old_ptep)))
>                         continue;
>
> -               pte = ptep_get_and_clear(mm, old_addr, old_pte);
> +               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>                 /*
>                  * If we are remapping a valid PTE, make sure
>                  * to flush TLB before we drop the PTL for the
> @@ -258,7 +259,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                 pte = move_soft_dirty_pte(pte);
>
>                 if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -                       pte_clear(mm, new_addr, new_pte);
> +                       pte_clear(mm, new_addr, new_ptep);
>                 else {
>                         if (need_clear_uffd_wp) {
>                                 if (pte_present(pte))
> @@ -266,7 +267,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                                 else if (is_swap_pte(pte))
>                                         pte = pte_swp_clear_uffd_wp(pte);
>                         }
> -                       set_pte_at(mm, new_addr, new_pte, pte);
> +                       set_pte_at(mm, new_addr, new_ptep, pte);
>                 }
>         }
>
> @@ -275,8 +276,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                 flush_tlb_range(vma, old_end - len, old_end);
>         if (new_ptl != old_ptl)
>                 spin_unlock(new_ptl);
> -       pte_unmap(new_pte - 1);
> -       pte_unmap_unlock(old_pte - 1, old_ptl);
> +       pte_unmap(new_ptep - 1);
> +       pte_unmap_unlock(old_ptep - 1, old_ptl);
>  out:
>         if (pmc->need_rmap_locks)
>                 drop_rmap_locks(vma);
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] mm: Call pointers to ptes as ptep
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-08  1:05   ` Barry Song
@ 2025-05-08  6:21   ` Anshuman Khandual
  2025-05-08  9:21   ` Lorenzo Stoakes
  2 siblings, 0 replies; 23+ messages in thread
From: Anshuman Khandual @ 2025-05-08  6:21 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 5/7/25 11:32, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> type variables with p. No functional change.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84..0163e02e5aa8 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_ptep, *new_ptep;
> +	pte_t pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * We don't have to worry about the ordering of src and dst
>  	 * pte locks because exclusive mmap_lock prevents deadlock.
>  	 */
> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -	if (!old_pte) {
> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +	if (!old_ptep) {
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * mmap_lock, so this new_pte page is stable, so there is no need to get
>  	 * pmdval and do pmd_same() check.
>  	 */
> -	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
> +	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>  					   &new_ptl);
> -	if (!new_pte) {
> -		pte_unmap_unlock(old_pte, old_ptl);
> +	if (!new_ptep) {
> +		pte_unmap_unlock(old_ptep, old_ptl);
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -235,12 +236,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  
> -	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> -				   new_pte++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_pte)))
> +	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> +				   new_ptep++, new_addr += PAGE_SIZE) {
> +		if (pte_none(ptep_get(old_ptep)))
>  			continue;
>  
> -		pte = ptep_get_and_clear(mm, old_addr, old_pte);
> +		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -258,7 +259,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		pte = move_soft_dirty_pte(pte);
>  
>  		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -			pte_clear(mm, new_addr, new_pte);
> +			pte_clear(mm, new_addr, new_ptep);
>  		else {
>  			if (need_clear_uffd_wp) {
>  				if (pte_present(pte))
> @@ -266,7 +267,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_pte, pte);
> +			set_pte_at(mm, new_addr, new_ptep, pte);
>  		}
>  	}
>  
> @@ -275,8 +276,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
> -	pte_unmap(new_pte - 1);
> -	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	pte_unmap(new_ptep - 1);
> +	pte_unmap_unlock(old_ptep - 1, old_ptl);
>  out:
>  	if (pmc->need_rmap_locks)
>  		drop_rmap_locks(vma);

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] mm: Call pointers to ptes as ptep
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-08  1:05   ` Barry Song
  2025-05-08  6:21   ` Anshuman Khandual
@ 2025-05-08  9:21   ` Lorenzo Stoakes
  2 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-08  9:21 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed, May 07, 2025 at 11:32:55AM +0530, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> type variables with p. No functional change.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, thanks!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mremap.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84..0163e02e5aa8 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	struct vm_area_struct *vma = pmc->old;
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_ptep, *new_ptep;
> +	pte_t pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * We don't have to worry about the ordering of src and dst
>  	 * pte locks because exclusive mmap_lock prevents deadlock.
>  	 */
> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -	if (!old_pte) {
> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +	if (!old_ptep) {
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	 * mmap_lock, so this new_pte page is stable, so there is no need to get
>  	 * pmdval and do pmd_same() check.
>  	 */
> -	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
> +	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
>  					   &new_ptl);
> -	if (!new_pte) {
> -		pte_unmap_unlock(old_pte, old_ptl);
> +	if (!new_ptep) {
> +		pte_unmap_unlock(old_ptep, old_ptl);
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -235,12 +236,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> -				   new_pte++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_pte)))
> +	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> +				   new_ptep++, new_addr += PAGE_SIZE) {
> +		if (pte_none(ptep_get(old_ptep)))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_pte);
> +		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -258,7 +259,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		pte = move_soft_dirty_pte(pte);
>
>  		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -			pte_clear(mm, new_addr, new_pte);
> +			pte_clear(mm, new_addr, new_ptep);
>  		else {
>  			if (need_clear_uffd_wp) {
>  				if (pte_present(pte))
> @@ -266,7 +267,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_pte, pte);
> +			set_pte_at(mm, new_addr, new_ptep, pte);
>  		}
>  	}
>
> @@ -275,8 +276,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
> -	pte_unmap(new_pte - 1);
> -	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	pte_unmap(new_ptep - 1);
> +	pte_unmap_unlock(old_ptep - 1, old_ptl);
>  out:
>  	if (pmc->need_rmap_locks)
>  		drop_rmap_locks(vma);
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-07  6:02 [PATCH v2 0/2] Optimize mremap() for large folios Dev Jain
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-05-07  6:02 ` Dev Jain
  2025-05-08  1:37   ` Andrew Morton
                     ` (3 more replies)
  2025-05-08 18:35 ` [PATCH v2 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 4 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-07  6:02 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

To use PTE batching, we want to determine whether the folio mapped by
the PTE is large, thus requiring the use of vm_normal_folio(). We want
to avoid the cost of vm_normal_folio() if the code path doesn't already
require the folio. For arm64, pte_batch_hint() does the job. To generalize
this hint, add a helper which will determine whether two consecutive PTEs
point to consecutive PFNs, in which case there is a high probability that
the underlying folio is large.
Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
 mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
 2 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..38dab1f562ed 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
 }
 #endif
 
+/**
+ * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
+ * to a large folio.
+ * @ptep: Pointer to the page table entry.
+ * @pte: The page table entry.
+ *
+ * This helper is invoked when the caller wants to batch over a set of ptes
+ * mapping a large folio, but the concerned code path does not already have
+ * the folio. We want to avoid the cost of vm_normal_folio() only to find that
+ * the underlying folio was small; i.e keep the small folio case as fast as
+ * possible.
+ *
+ * The caller must ensure that ptep + 1 exists.
+ */
+static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+	pte_t *next_ptep, next_pte;
+
+	if (pte_batch_hint(ptep, pte) != 1)
+		return true;
+
+	next_ptep = ptep + 1;
+	next_pte = ptep_get(next_ptep);
+	if (!pte_present(next_pte))
+		return false;
+
+	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
+}
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
diff --git a/mm/mremap.c b/mm/mremap.c
index 0163e02e5aa8..9c88a276bec4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+/* mremap a batch of PTEs mapping the same large folio */
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t pte, int max_nr)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	struct folio *folio;
+	int nr = 1;
+
+	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
+		folio = vm_normal_folio(vma, addr, pte);
+		if (folio && folio_test_large(folio))
+			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
+					     flags, NULL, NULL, NULL);
+	}
+	return nr;
+}
+
 static int move_ptes(struct pagetable_move_control *pmc,
 		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
 {
@@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_ptep, *new_ptep;
-	pte_t pte;
+	pte_t old_pte, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	unsigned long old_end = old_addr + extent;
 	unsigned long len = old_end - old_addr;
 	int err = 0;
+	int max_nr;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
-				   new_ptep++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_ptep)))
+	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
+				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
+		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
+		old_pte = ptep_get(old_ptep);
+		if (pte_none(old_pte))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
-		if (pte_present(pte))
+		if (pte_present(old_pte)) {
+			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
+						    old_pte, max_nr);
 			force_flush = true;
+		}
+		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
@@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_ptep, pte);
+			set_ptes(mm, new_addr, new_ptep, pte, nr);
 		}
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-08  1:37   ` Andrew Morton
  2025-05-08  4:53     ` Lorenzo Stoakes
  2025-05-08  2:00   ` Zi Yan
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2025-05-08  1:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed,  7 May 2025 11:32:56 +0530 Dev Jain <dev.jain@arm.com> wrote:

> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
> 
> ...
>
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>  
> +/**
> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> + * to a large folio.
> + * @ptep: Pointer to the page table entry.
> + * @pte: The page table entry.
> + *
> + * This helper is invoked when the caller wants to batch over a set of ptes
> + * mapping a large folio, but the concerned code path does not already have
> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> + * the underlying folio was small; i.e keep the small folio case as fast as
> + * possible.
> + *
> + * The caller must ensure that ptep + 1 exists.
> + */

Gee, that isn't the prettiest interface we've ever invented.

Is there any prospect that this function will get another caller?  If
not then it's probably better to just open-code all this in the caller
and fill it up with comments.


Anyway, review of this series is scanty.  I'd normally enter
wait-and-see mode, but this:

> the average execution time reduces from 1.9 to
> 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).

prompts me into push-it-along mode.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08  1:37   ` Andrew Morton
@ 2025-05-08  4:53     ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-08  4:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dev Jain, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed, May 07, 2025 at 06:37:03PM -0700, Andrew Morton wrote:
> On Wed,  7 May 2025 11:32:56 +0530 Dev Jain <dev.jain@arm.com> wrote:
>
> > To use PTE batching, we want to determine whether the folio mapped by
> > the PTE is large, thus requiring the use of vm_normal_folio(). We want
> > to avoid the cost of vm_normal_folio() if the code path doesn't already
> > require the folio. For arm64, pte_batch_hint() does the job. To generalize
> > this hint, add a helper which will determine whether two consecutive PTEs
> > point to consecutive PFNs, in which case there is a high probability that
> > the underlying folio is large.
> > Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > are painted with the contig bit, then ptep_get() will iterate through all 16
> > entries to collect a/d bits. Hence this optimization will result in a 16x
> > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > will eventually call contpte_try_unfold() on every contig block, thus
> > flushing the TLB for the complete large folio range. Instead, use
> > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > do them on the starting and ending contig block.
> >
> > ...
> >
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +/**
> > + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> > + * to a large folio.
> > + * @ptep: Pointer to the page table entry.
> > + * @pte: The page table entry.
> > + *
> > + * This helper is invoked when the caller wants to batch over a set of ptes
> > + * mapping a large folio, but the concerned code path does not already have
> > + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> > + * the underlying folio was small; i.e keep the small folio case as fast as
> > + * possible.
> > + *
> > + * The caller must ensure that ptep + 1 exists.
> > + */
>
> Gee, that isn't the prettiest interface we've ever invented.
>
> Is there any prospect that this function will get another caller?  If
> not then it's probably better to just open-code all this in the caller
> and fill it up with comments.
>
>
> Anyway, review of this series is scanty.  I'd normally enter
> wait-and-see mode, but this:
>
> > the average execution time reduces from 1.9 to
> > 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).
>
> prompts me into push-it-along mode.
>
>

There's been a ton of review on the v1 (see [0]), I will get to this one
soon, but as you can see from the v1 review there's lots we need to check
for correctness.

So it's coming :)

[0]: https://lore.kernel.org/linux-mm/20250506050056.59250-1-dev.jain@arm.com/

Let's not be too hasty to merge this though until we're sure it's safe
(this is delicate code). But I'll be giving R-b tags on a finalised version
anyway.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-08  1:37   ` Andrew Morton
@ 2025-05-08  2:00   ` Zi Yan
  2025-05-08  4:01     ` Dev Jain
  2025-05-08  6:31     ` Anshuman Khandual
  2025-05-08  7:16   ` Anshuman Khandual
  2025-05-08 10:04   ` Lorenzo Stoakes
  3 siblings, 2 replies; 23+ messages in thread
From: Zi Yan @ 2025-05-08  2:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, anshuman.khandual,
	willy, ioworker0, yang, baolin.wang, hughd

On 7 May 2025, at 2:02, Dev Jain wrote:

> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>  mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>  2 files changed, 59 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..38dab1f562ed 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>
> +/**
> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> + * to a large folio.
> + * @ptep: Pointer to the page table entry.
> + * @pte: The page table entry.
> + *
> + * This helper is invoked when the caller wants to batch over a set of ptes
> + * mapping a large folio, but the concerned code path does not already have
> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> + * the underlying folio was small; i.e keep the small folio case as fast as
> + * possible.
> + *
> + * The caller must ensure that ptep + 1 exists.

ptep points to an entry in a PTE page. As long as it is not pointing
to the last entry, ptep+1 should always exist. With PTRS_PER_PTE and
sizeof(pte_t), you can check ptep address to figure out whether it
is the last entry of a PTE page, right? Let me know if I misunderstand
anything.


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08  2:00   ` Zi Yan
@ 2025-05-08  4:01     ` Dev Jain
  2025-05-08  6:31     ` Anshuman Khandual
  1 sibling, 0 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-08  4:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, anshuman.khandual,
	willy, ioworker0, yang, baolin.wang, hughd



On 08/05/25 7:30 am, Zi Yan wrote:
> On 7 May 2025, at 2:02, Dev Jain wrote:
> 
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>   2 files changed, 59 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..38dab1f562ed 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>>
>> +/**
>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>> + * to a large folio.
>> + * @ptep: Pointer to the page table entry.
>> + * @pte: The page table entry.
>> + *
>> + * This helper is invoked when the caller wants to batch over a set of ptes
>> + * mapping a large folio, but the concerned code path does not already have
>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>> + * the underlying folio was small; i.e keep the small folio case as fast as
>> + * possible.
>> + *
>> + * The caller must ensure that ptep + 1 exists.
> 
> ptep points to an entry in a PTE page. As long as it is not pointing
> to the last entry, ptep+1 should always exist. With PTRS_PER_PTE and
> sizeof(pte_t), you can check ptep address to figure out whether it
> is the last entry of a PTE page, right? Let me know if I misunderstand
> anything.

Sounds correct to me.

> 
> 
> --
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08  2:00   ` Zi Yan
  2025-05-08  4:01     ` Dev Jain
@ 2025-05-08  6:31     ` Anshuman Khandual
  1 sibling, 0 replies; 23+ messages in thread
From: Anshuman Khandual @ 2025-05-08  6:31 UTC (permalink / raw)
  To: Zi Yan, Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, baohua, willy, ioworker0, yang,
	baolin.wang, hughd



On 5/8/25 07:30, Zi Yan wrote:
> On 7 May 2025, at 2:02, Dev Jain wrote:
> 
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>  mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>  2 files changed, 59 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..38dab1f562ed 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>  }
>>  #endif
>>
>> +/**
>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>> + * to a large folio.
>> + * @ptep: Pointer to the page table entry.
>> + * @pte: The page table entry.
>> + *
>> + * This helper is invoked when the caller wants to batch over a set of ptes
>> + * mapping a large folio, but the concerned code path does not already have
>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>> + * the underlying folio was small; i.e keep the small folio case as fast as
>> + * possible.
>> + *
>> + * The caller must ensure that ptep + 1 exists.
> 
> ptep points to an entry in a PTE page. As long as it is not pointing
> to the last entry, ptep+1 should always exist. With PTRS_PER_PTE and
> sizeof(pte_t), you can check ptep address to figure out whether it
> is the last entry of a PTE page, right? Let me know if I misunderstand
> anything.

Agreed, this not-the-last-pte-entry test is definitely required here just
to prevent a potential unmapped access crash. But I do agree with Andrew
that unless there are callers, this should be contained in the call site
itself (mm/mremap.c) with a good explanation.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-05-08  1:37   ` Andrew Morton
  2025-05-08  2:00   ` Zi Yan
@ 2025-05-08  7:16   ` Anshuman Khandual
  2025-05-08  8:05     ` Dev Jain
  2025-05-08  9:40     ` Lorenzo Stoakes
  2025-05-08 10:04   ` Lorenzo Stoakes
  3 siblings, 2 replies; 23+ messages in thread
From: Anshuman Khandual @ 2025-05-08  7:16 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 5/7/25 11:32, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>  mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>  2 files changed, 59 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..38dab1f562ed 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>  
> +/**
> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> + * to a large folio.
> + * @ptep: Pointer to the page table entry.
> + * @pte: The page table entry.
> + *
> + * This helper is invoked when the caller wants to batch over a set of ptes
> + * mapping a large folio, but the concerned code path does not already have
> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> + * the underlying folio was small; i.e keep the small folio case as fast as
> + * possible.
> + *
> + * The caller must ensure that ptep + 1 exists.
> + */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
> +}
> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					    unsigned long address,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 0163e02e5aa8..9c88a276bec4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>  
> +/* mremap a batch of PTEs mapping the same large folio */
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +	int nr = 1;

A small nit - s/nr/nr_pages ?

> +
> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {

Like mentioned earlier in v1, could maybe_contiguous_pte_pfns() here
add some additional cost for buffers that are actually not mapped to
contig physical pages.

The test case you have mentioned in the cover demonstrating performance
gains might have always been run just after boot, thus increasing the
probability of contiguous physical mapping, which will not be the case
on fragmented memory systems. In that case the proposed consecutive PFN
comparison will always happen unconditionally without any benefit ?

Just curious.

From V1

--------------------------------------------------------------------
maybe_contiguous_pte_pfns() cost will be applicable for memory
areas greater than a single PAGE_SIZE (i.e max_nr != 1) ? This
helper extracts an additional consecutive pte, ensures that it
is valid mapped and extracts pfn before comparing for the span.

There is some cost associated with the above code sequence which
looks justified for sequential access of memory buffers that has
consecutive physical memory backing. But what happens when such
buffers are less probable, will those buffers take a performance
hit for all the comparisons that just turn out to be negative ?
--------------------------------------------------------------------

> +		folio = vm_normal_folio(vma, addr, pte);
> +		if (folio && folio_test_large(folio))
> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
> +					     flags, NULL, NULL, NULL);
> +	}
> +	return nr;
> +}
> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
>  	int err = 0;
> +	int max_nr;

A small nit - s/max_nr/max_nr_pages ?

>  
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_ptep)))
> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {


> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>  			continue;
>  
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +						    old_pte, max_nr);
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>  
> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>  		}
>  	}
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08  7:16   ` Anshuman Khandual
@ 2025-05-08  8:05     ` Dev Jain
  2025-05-08  9:40     ` Lorenzo Stoakes
  1 sibling, 0 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-08  8:05 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang,
	baolin.wang, ziy, hughd



On 08/05/25 12:46 pm, Anshuman Khandual wrote:
> On 5/7/25 11:32, Dev Jain wrote:
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>   2 files changed, 59 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..38dab1f562ed 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>>   
>> +/**
>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>> + * to a large folio.
>> + * @ptep: Pointer to the page table entry.
>> + * @pte: The page table entry.
>> + *
>> + * This helper is invoked when the caller wants to batch over a set of ptes
>> + * mapping a large folio, but the concerned code path does not already have
>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>> + * the underlying folio was small; i.e keep the small folio case as fast as
>> + * possible.
>> + *
>> + * The caller must ensure that ptep + 1 exists.
>> + */
>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>> +{
>> +	pte_t *next_ptep, next_pte;
>> +
>> +	if (pte_batch_hint(ptep, pte) != 1)
>> +		return true;
>> +
>> +	next_ptep = ptep + 1;
>> +	next_pte = ptep_get(next_ptep);
>> +	if (!pte_present(next_pte))
>> +		return false;
>> +
>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
>> +}
>> +
>>   #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>   static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   					    unsigned long address,
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 0163e02e5aa8..9c88a276bec4 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>   	return pte;
>>   }
>>   
>> +/* mremap a batch of PTEs mapping the same large folio */
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	struct folio *folio;
>> +	int nr = 1;
> 
> A small nit - s/nr/nr_pages ?

Well, all other places nr is being used, so I would like to keep it 
simple and stick to convention :)

> 
>> +
>> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
> 
> Like mentioned earlier in v1, could maybe_contiguous_pte_pfns() here
> add some additional cost for buffers that are actually not mapped to
> contig physical pages.
> 
> The test case you have mentioned in the cover demonstrating performance
> gains might have always been run just after boot, thus increasing the
> probability of contiguous physical mapping, which will not be the case
> on fragmented memory systems. In that case the proposed consecutive PFN
> comparison will always happen unconditionally without any benefit ?

I think you mean to say that the underlying folio may not be actually 
large but the buddy allocator distributed consecutive physical memory.
Hmm...at this rate I am thinking that the overhead of vm_normal_folio() 
+ folio_test_large() is acceptable and is less churn :) Would like to 
hear your thoughts.

> 
> Just curious.
> 
>  From V1
> 
> --------------------------------------------------------------------
> maybe_contiguous_pte_pfns() cost will be applicable for memory
> areas greater than a single PAGE_SIZE (i.e max_nr != 1) ? This
> helper extracts an additional consecutive pte, ensures that it
> is valid mapped and extracts pfn before comparing for the span.
> 
> There is some cost associated with the above code sequence which
> looks justified for sequential access of memory buffers that has
> consecutive physical memory backing. But what happens when such
> buffers are less probable, will those buffers take a performance
> hit for all the comparisons that just turn out to be negative ?
> --------------------------------------------------------------------
> 
>> +		folio = vm_normal_folio(vma, addr, pte);
>> +		if (folio && folio_test_large(folio))
>> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
>> +					     flags, NULL, NULL, NULL);
>> +	}
>> +	return nr;
>> +}
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pte_t *old_ptep, *new_ptep;
>> -	pte_t pte;
>> +	pte_t old_pte, pte;
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>>   	int err = 0;
>> +	int max_nr;
> 
> A small nit - s/max_nr/max_nr_pages ?
> 
>>   
>>   	/*
>>   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   
>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_ptep)))
>> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
>> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
> 
> 
>> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>> +		old_pte = ptep_get(old_ptep);
>> +		if (pte_none(old_pte))
>>   			continue;
>>   
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +						    old_pte, max_nr);
>>   			force_flush = true;
>> +		}
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>   
>> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>   		}
>>   	}
>>   



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08  7:16   ` Anshuman Khandual
  2025-05-08  8:05     ` Dev Jain
@ 2025-05-08  9:40     ` Lorenzo Stoakes
  1 sibling, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-08  9:40 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Dev Jain, akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On Thu, May 08, 2025 at 12:46:41PM +0530, Anshuman Khandual wrote:
> On 5/7/25 11:32, Dev Jain wrote:
[snip]
> > +/* mremap a batch of PTEs mapping the same large folio */
> > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > +		pte_t *ptep, pte_t pte, int max_nr)
> > +{
> > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > +	struct folio *folio;
> > +	int nr = 1;
>
> A small nit - s/nr/nr_pages ?

Agree in principal, but I think nr_ptes is clearer. You might not even be
dealing with a page, if you hit a pte_none() pte for instance.

So max_nr_ptes, nr_ptes.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
                     ` (2 preceding siblings ...)
  2025-05-08  7:16   ` Anshuman Khandual
@ 2025-05-08 10:04   ` Lorenzo Stoakes
  2025-05-08 18:01     ` David Hildenbrand
  2025-05-18  8:17     ` Dev Jain
  3 siblings, 2 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-08 10:04 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

Before getting into the review, just to say thanks for refactoring as per
my (and of course other's) comments, much appreciated and big improvement!
:)

We're getting there...

On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>  mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>  2 files changed, 59 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..38dab1f562ed 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>
> +/**
> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> + * to a large folio.
> + * @ptep: Pointer to the page table entry.
> + * @pte: The page table entry.
> + *
> + * This helper is invoked when the caller wants to batch over a set of ptes
> + * mapping a large folio, but the concerned code path does not already have
> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> + * the underlying folio was small; i.e keep the small folio case as fast as
> + * possible.
> + *
> + * The caller must ensure that ptep + 1 exists.
> + */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);

Let's not do unlikely()'s unless we have data for them... it shouldn't mean
'what the programmer believes' :)

> +}

Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
(I mean _perhaps_ unavoidably) and we've done the relevant check in
mremap_folio_pte_batch(), so let's just move it there with comments, as this

> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					    unsigned long address,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 0163e02e5aa8..9c88a276bec4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>
> +/* mremap a batch of PTEs mapping the same large folio */
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +	int nr = 1;
> +
> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
> +		folio = vm_normal_folio(vma, addr, pte);
> +		if (folio && folio_test_large(folio))
> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
> +					     flags, NULL, NULL, NULL);
> +	}

This needs some refactoring, avoid nesting at all costs :)

We'll want to move the maybe_contiguous_pte_pfns() function over here, so
that'll change things, but in general let's use a guard clause.

So an if block like:

if (foo) {
	... bunch of logic ...
}

Is better replaced with a guard clause so you have:

if (!foo)
	return ...;

... bunch of logic ...

Here we could really expand things out to make things SUPER clear like:

static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
		pte_t *ptep, pte_t pte, int max_nr)
{
	const fpb_t flags;
	struct folio *folio;

	if (max_nr == 1)
		return 1;
	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
		return 1;

	folio = vm_normal_folio(vma, addr, pte);
	if (!folio)
		return 1;
	if (!folio_test_large(folio))
		return 1;

	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
		flags, NULL, NULL, NULL);
}

I mean you could argue assign nr would be neater here, but you get the point!

David mentioned a point about this code over in v1 discussion (see
[0]). Trying to bring converastion here to avoid it being split across
old/new series. There he said:

David H:
> (2) Do we really need "must be part of the same folio", or could be just batch over present
> ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
> might be better.

Hm, if we didn't do the batch test, can we batch a split large folio here ok?
I'm guessing we can in which case this check is actually limiting...

Are we _explicitly_ only considering the cont pte case and ignoring the
split THP case?

[0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/

And in what circumstances will the hint be set, with a present subsequent
PTE but !folio_test_large()?

I guess the hint might not be taken? But then isn't the valid check just
folio_test_large() and we don't need this batched check at all?

Is it only to avoid the split THP case?

We definitely need some clarity here, and a comment in the code explaining
what's going on as this is subtle stuff.

> +	return nr;
> +}
> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
>  	int err = 0;
> +	int max_nr;
>
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_ptep)))
> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {

Really nitty thing here but the indentation is all messed up here, I mean
nothing is going to be nice but maybe indent by two tabs below 'for'.

I'm not a fan of this declaration of nr, typically in a for loop a declaration
here would be the counter, so this is just confusing.

In the old implementation, declaring nr in the for loop would make sense,
but in the newly refactored one you should just declare it at the top.

Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.

I don't think 'nr' needs to be initialised either, since the conditional is
'old_addr < old_end' and you _should_ only perform the

> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))

This seems broken.

You're missing a nr assignment here, so you'll happen to offset by the
number of pages of the last folio you encountered?

Should be:

	if (pte_none(old_pte)) {
		nr_ptes = 1;
		continue;
	}

Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.


>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);

>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +						    old_pte, max_nr);
>  			force_flush = true;
> +		}

Thanks this is much clearer compared to v1

> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);

Nit but...

Can we have a comment indicating what the last parameter refers to? I think
David maybe doens't like this so obviously if he prefers not that fine, but
I'm thinking something like:

pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);

I think we are good to just use 'false' here right? As it's only an int for
historical purposes...

>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08 10:04   ` Lorenzo Stoakes
@ 2025-05-08 18:01     ` David Hildenbrand
  2025-05-18  8:17     ` Dev Jain
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-08 18:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 08.05.25 12:04, Lorenzo Stoakes wrote:
> Before getting into the review, just to say thanks for refactoring as per
> my (and of course other's) comments, much appreciated and big improvement!
> :)
> 
> We're getting there...
> 
> On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>   2 files changed, 59 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..38dab1f562ed 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>>
>> +/**
>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>> + * to a large folio.
>> + * @ptep: Pointer to the page table entry.
>> + * @pte: The page table entry.
>> + *
>> + * This helper is invoked when the caller wants to batch over a set of ptes
>> + * mapping a large folio, but the concerned code path does not already have
>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>> + * the underlying folio was small; i.e keep the small folio case as fast as
>> + * possible.
>> + *
>> + * The caller must ensure that ptep + 1 exists.
>> + */
>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>> +{
>> +	pte_t *next_ptep, next_pte;
>> +
>> +	if (pte_batch_hint(ptep, pte) != 1)
>> +		return true;
>> +
>> +	next_ptep = ptep + 1;
>> +	next_pte = ptep_get(next_ptep);
>> +	if (!pte_present(next_pte))
>> +		return false;
>> +
>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
> 
> Let's not do unlikely()'s unless we have data for them... it shouldn't mean
> 'what the programmer believes' :)
> 
>> +}
> 
> Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
> (I mean _perhaps_ unavoidably) and we've done the relevant check in
> mremap_folio_pte_batch(), so let's just move it there with comments, as this
> 
>> +
>>   #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>   static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   					    unsigned long address,
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 0163e02e5aa8..9c88a276bec4 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>   	return pte;
>>   }
>>
>> +/* mremap a batch of PTEs mapping the same large folio */
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	struct folio *folio;
>> +	int nr = 1;
>> +
>> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
>> +		folio = vm_normal_folio(vma, addr, pte);
>> +		if (folio && folio_test_large(folio))
>> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
>> +					     flags, NULL, NULL, NULL);
>> +	}
> 
> This needs some refactoring, avoid nesting at all costs :)
> 
> We'll want to move the maybe_contiguous_pte_pfns() function over here, so
> that'll change things, but in general let's use a guard clause.
> 
> So an if block like:
> 
> if (foo) {
> 	... bunch of logic ...
> }
> 
> Is better replaced with a guard clause so you have:
> 
> if (!foo)
> 	return ...;
> 
> ... bunch of logic ...
> 
> Here we could really expand things out to make things SUPER clear like:
> 
> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> 		pte_t *ptep, pte_t pte, int max_nr)
> {
> 	const fpb_t flags;
> 	struct folio *folio;
> 
> 	if (max_nr == 1)
> 		return 1;
> 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
> 		return 1;
> 
> 	folio = vm_normal_folio(vma, addr, pte);
> 	if (!folio)
> 		return 1;
> 	if (!folio_test_large(folio))
> 		return 1;
> 
> 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
> 		flags, NULL, NULL, NULL);
> }
> 
> I mean you could argue assign nr would be neater here, but you get the point!
> 
> David mentioned a point about this code over in v1 discussion (see
> [0]). Trying to bring converastion here to avoid it being split across
> old/new series. There he said:
> 
> David H:
>> (2) Do we really need "must be part of the same folio", or could be just batch over present
>> ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
>> might be better.
> 
> Hm, if we didn't do the batch test, can we batch a split large folio here ok?
> I'm guessing we can in which case this check is actually limiting...
> 
> Are we _explicitly_ only considering the cont pte case and ignoring the
> split THP case?
> 
> [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
> 
> And in what circumstances will the hint be set, with a present subsequent
> PTE but !folio_test_large()?
> 
> I guess the hint might not be taken? But then isn't the valid check just
> folio_test_large() and we don't need this batched check at all?
> 
> Is it only to avoid the split THP case?
> 
> We definitely need some clarity here, and a comment in the code explaining
> what's going on as this is subtle stuff.


The whole maybe_contiguous_pte_pfns() is really far from nice. Let's add 
something like that *only if absolutely required* for performance 
reasons and on top of this patch.

But let's clarify if we have to limit ourselves to a single folio at all.

Staring at it, I think the tricky bit is:

	pte = get_and_clear_full_ptes();


That means that

a) if any PTE is dirty, the resulting PTE will be dirty.
b) if any PTE is young, the resulting PTE will be young.

So we could be marking PTEs as dirty/young from unrelated folios.

That's problematic.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-08 10:04   ` Lorenzo Stoakes
  2025-05-08 18:01     ` David Hildenbrand
@ 2025-05-18  8:17     ` Dev Jain
  2025-05-19  9:04       ` Lorenzo Stoakes
  1 sibling, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-05-18  8:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd



On 08/05/25 3:34 pm, Lorenzo Stoakes wrote:
> Before getting into the review, just to say thanks for refactoring as per
> my (and of course other's) comments, much appreciated and big improvement!
> :)
> 
> We're getting there...
> 
> On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>   2 files changed, 59 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..38dab1f562ed 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>>
>> +/**
>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>> + * to a large folio.
>> + * @ptep: Pointer to the page table entry.
>> + * @pte: The page table entry.
>> + *
>> + * This helper is invoked when the caller wants to batch over a set of ptes
>> + * mapping a large folio, but the concerned code path does not already have
>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>> + * the underlying folio was small; i.e keep the small folio case as fast as
>> + * possible.
>> + *
>> + * The caller must ensure that ptep + 1 exists.
>> + */
>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>> +{
>> +	pte_t *next_ptep, next_pte;
>> +
>> +	if (pte_batch_hint(ptep, pte) != 1)
>> +		return true;
>> +
>> +	next_ptep = ptep + 1;
>> +	next_pte = ptep_get(next_ptep);
>> +	if (!pte_present(next_pte))
>> +		return false;
>> +
>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
> 
> Let's not do unlikely()'s unless we have data for them... it shouldn't mean
> 'what the programmer believes' :)
> 
>> +}
> 
> Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
> (I mean _perhaps_ unavoidably) and we've done the relevant check in
> mremap_folio_pte_batch(), so let's just move it there with comments, as this
> 
>> +
>>   #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>   static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   					    unsigned long address,
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 0163e02e5aa8..9c88a276bec4 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>   	return pte;
>>   }
>>
>> +/* mremap a batch of PTEs mapping the same large folio */
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	struct folio *folio;
>> +	int nr = 1;
>> +
>> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
>> +		folio = vm_normal_folio(vma, addr, pte);
>> +		if (folio && folio_test_large(folio))
>> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
>> +					     flags, NULL, NULL, NULL);
>> +	}
> 
> This needs some refactoring, avoid nesting at all costs :)
> 
> We'll want to move the maybe_contiguous_pte_pfns() function over here, so
> that'll change things, but in general let's use a guard clause.
> 
> So an if block like:
> 
> if (foo) {
> 	... bunch of logic ...
> }
> 
> Is better replaced with a guard clause so you have:
> 
> if (!foo)
> 	return ...;
> 
> ... bunch of logic ...
> 
> Here we could really expand things out to make things SUPER clear like:
> 
> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> 		pte_t *ptep, pte_t pte, int max_nr)
> {
> 	const fpb_t flags;
> 	struct folio *folio;
> 
> 	if (max_nr == 1)
> 		return 1;
> 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
> 		return 1;
> 
> 	folio = vm_normal_folio(vma, addr, pte);
> 	if (!folio)
> 		return 1;
> 	if (!folio_test_large(folio))
> 		return 1;
> 
> 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
> 		flags, NULL, NULL, NULL);
> }
> 
> I mean you could argue assign nr would be neater here, but you get the point!
> 
> David mentioned a point about this code over in v1 discussion (see
> [0]). Trying to bring converastion here to avoid it being split across
> old/new series. There he said:
> 
> David H:
>> (2) Do we really need "must be part of the same folio", or could be just batch over present
>> ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
>> might be better.
> 
> Hm, if we didn't do the batch test, can we batch a split large folio here ok?
> I'm guessing we can in which case this check is actually limiting...
> 
> Are we _explicitly_ only considering the cont pte case and ignoring the
> split THP case?
> 
> [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
> 
> And in what circumstances will the hint be set, with a present subsequent
> PTE but !folio_test_large()?
> 
> I guess the hint might not be taken? But then isn't the valid check just
> folio_test_large() and we don't need this batched check at all?
> 
> Is it only to avoid the split THP case?
> 
> We definitely need some clarity here, and a comment in the code explaining
> what's going on as this is subtle stuff.

I am focussed only on batching large folios. Split THPs won't be 
batched; you can use pte_batch() (from David's refactoring) and
figure the split THP batch out, but then get_and_clear_full_ptes()
will be gathering a/d bits and smearing them across the batch, which 
will be incorrect. Even if we introduce a new version of 
get_and_clear_full_ptes() which does not gather a/d bits, then if the 
pte_batch actually belongs to a folio, then we will *not* be smearing 
a/d bits, which is again wrong. So in any case we must know what the 
underlying folio looks like :) So my agenda for v3 is,

- Incorporate your refactoring comments
- Remove maybe_contiguous_pte_pfns and just use vm_normal_folio + 
folio_test_large
- Fix indentation

Sounds good?

> 
>> +	return nr;
>> +}
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pte_t *old_ptep, *new_ptep;
>> -	pte_t pte;
>> +	pte_t old_pte, pte;
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>>   	int err = 0;
>> +	int max_nr;
>>
>>   	/*
>>   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>
>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_ptep)))
>> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
>> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
> 
> Really nitty thing here but the indentation is all messed up here, I mean
> nothing is going to be nice but maybe indent by two tabs below 'for'.
> 
> I'm not a fan of this declaration of nr, typically in a for loop a declaration
> here would be the counter, so this is just confusing.
> 
> In the old implementation, declaring nr in the for loop would make sense,
> but in the newly refactored one you should just declare it at the top.
> 
> Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.
> 
> I don't think 'nr' needs to be initialised either, since the conditional is
> 'old_addr < old_end' and you _should_ only perform the
> 
>> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>> +		old_pte = ptep_get(old_ptep);
>> +		if (pte_none(old_pte))
> 
> This seems broken.
> 
> You're missing a nr assignment here, so you'll happen to offset by the
> number of pages of the last folio you encountered?
> 
> Should be:
> 
> 	if (pte_none(old_pte)) {
> 		nr_ptes = 1;
> 		continue;
> 	}
> 
> Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.
> 
> 
>>   			continue;
>>
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> 
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +						    old_pte, max_nr);
>>   			force_flush = true;
>> +		}
> 
> Thanks this is much clearer compared to v1
> 
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
> 
> Nit but...
> 
> Can we have a comment indicating what the last parameter refers to? I think
> David maybe doens't like this so obviously if he prefers not that fine, but
> I'm thinking something like:
> 
> pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);
> 
> I think we are good to just use 'false' here right? As it's only an int for
> historical purposes...
> 
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>
>> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>   		}
>>   	}
>>
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-18  8:17     ` Dev Jain
@ 2025-05-19  9:04       ` Lorenzo Stoakes
  2025-05-20  9:21         ` Dev Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19  9:04 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Sun, May 18, 2025 at 01:47:35PM +0530, Dev Jain wrote:
>
>
> On 08/05/25 3:34 pm, Lorenzo Stoakes wrote:
> > Before getting into the review, just to say thanks for refactoring as per
> > my (and of course other's) comments, much appreciated and big improvement!
> > :)
> >
> > We're getting there...
> >
> > On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
> > > To use PTE batching, we want to determine whether the folio mapped by
> > > the PTE is large, thus requiring the use of vm_normal_folio(). We want
> > > to avoid the cost of vm_normal_folio() if the code path doesn't already
> > > require the folio. For arm64, pte_batch_hint() does the job. To generalize
> > > this hint, add a helper which will determine whether two consecutive PTEs
> > > point to consecutive PFNs, in which case there is a high probability that
> > > the underlying folio is large.
> > > Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > > are painted with the contig bit, then ptep_get() will iterate through all 16
> > > entries to collect a/d bits. Hence this optimization will result in a 16x
> > > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > > will eventually call contpte_try_unfold() on every contig block, thus
> > > flushing the TLB for the complete large folio range. Instead, use
> > > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > > do them on the starting and ending contig block.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
> > >   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
> > >   2 files changed, 59 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index b50447ef1c92..38dab1f562ed 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> > >   }
> > >   #endif
> > >
> > > +/**
> > > + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> > > + * to a large folio.
> > > + * @ptep: Pointer to the page table entry.
> > > + * @pte: The page table entry.
> > > + *
> > > + * This helper is invoked when the caller wants to batch over a set of ptes
> > > + * mapping a large folio, but the concerned code path does not already have
> > > + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> > > + * the underlying folio was small; i.e keep the small folio case as fast as
> > > + * possible.
> > > + *
> > > + * The caller must ensure that ptep + 1 exists.
> > > + */
> > > +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> > > +{
> > > +	pte_t *next_ptep, next_pte;
> > > +
> > > +	if (pte_batch_hint(ptep, pte) != 1)
> > > +		return true;
> > > +
> > > +	next_ptep = ptep + 1;
> > > +	next_pte = ptep_get(next_ptep);
> > > +	if (!pte_present(next_pte))
> > > +		return false;
> > > +
> > > +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
> >
> > Let's not do unlikely()'s unless we have data for them... it shouldn't mean
> > 'what the programmer believes' :)
> >
> > > +}
> >
> > Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
> > (I mean _perhaps_ unavoidably) and we've done the relevant check in
> > mremap_folio_pte_batch(), so let's just move it there with comments, as this
> >
> > > +
> > >   #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > >   static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   					    unsigned long address,
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 0163e02e5aa8..9c88a276bec4 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> > >   	return pte;
> > >   }
> > >
> > > +/* mremap a batch of PTEs mapping the same large folio */
> > > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > > +		pte_t *ptep, pte_t pte, int max_nr)
> > > +{
> > > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +	struct folio *folio;
> > > +	int nr = 1;
> > > +
> > > +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
> > > +		folio = vm_normal_folio(vma, addr, pte);
> > > +		if (folio && folio_test_large(folio))
> > > +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
> > > +					     flags, NULL, NULL, NULL);
> > > +	}
> >
> > This needs some refactoring, avoid nesting at all costs :)
> >
> > We'll want to move the maybe_contiguous_pte_pfns() function over here, so
> > that'll change things, but in general let's use a guard clause.
> >
> > So an if block like:
> >
> > if (foo) {
> > 	... bunch of logic ...
> > }
> >
> > Is better replaced with a guard clause so you have:
> >
> > if (!foo)
> > 	return ...;
> >
> > ... bunch of logic ...
> >
> > Here we could really expand things out to make things SUPER clear like:
> >
> > static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > 		pte_t *ptep, pte_t pte, int max_nr)
> > {
> > 	const fpb_t flags;
> > 	struct folio *folio;
> >
> > 	if (max_nr == 1)
> > 		return 1;
> > 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
> > 		return 1;
> >
> > 	folio = vm_normal_folio(vma, addr, pte);
> > 	if (!folio)
> > 		return 1;
> > 	if (!folio_test_large(folio))
> > 		return 1;
> >
> > 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
> > 		flags, NULL, NULL, NULL);
> > }
> >
> > I mean you could argue assign nr would be neater here, but you get the point!
> >
> > David mentioned a point about this code over in v1 discussion (see
> > [0]). Trying to bring converastion here to avoid it being split across
> > old/new series. There he said:
> >
> > David H:
> > > (2) Do we really need "must be part of the same folio", or could be just batch over present
> > > ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
> > > might be better.
> >
> > Hm, if we didn't do the batch test, can we batch a split large folio here ok?
> > I'm guessing we can in which case this check is actually limiting...
> >
> > Are we _explicitly_ only considering the cont pte case and ignoring the
> > split THP case?
> >
> > [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
> >
> > And in what circumstances will the hint be set, with a present subsequent
> > PTE but !folio_test_large()?
> >
> > I guess the hint might not be taken? But then isn't the valid check just
> > folio_test_large() and we don't need this batched check at all?
> >
> > Is it only to avoid the split THP case?
> >
> > We definitely need some clarity here, and a comment in the code explaining
> > what's going on as this is subtle stuff.
>
> I am focussed only on batching large folios. Split THPs won't be batched;
> you can use pte_batch() (from David's refactoring) and
> figure the split THP batch out, but then get_and_clear_full_ptes()
> will be gathering a/d bits and smearing them across the batch, which will be
> incorrect. Even if we introduce a new version of get_and_clear_full_ptes()
> which does not gather a/d bits, then if the pte_batch actually belongs to a
> folio, then we will *not* be smearing a/d bits, which is again wrong. So in
> any case we must know what the underlying folio looks like :) So my agenda
> for v3 is,

Right, ack, there being a large folio per se doesn't mean A/D collection is
appropriate in THP case.

I guess then this is really _only_ about the mTHP case, where essentially
the other PTEs are to be disregarded going forwards?

>
> - Incorporate your refactoring comments
> - Remove maybe_contiguous_pte_pfns and just use vm_normal_folio +
> folio_test_large
> - Fix indentation
>
> Sounds good?

Sure, but can we hold off until the mprotect() stuff is done first please?
I mean obviously you're free to do things as you like, but this will help
workload-wise on my side :>)

Thanks!

>
> >
> > > +	return nr;
> > > +}
> > > +
> > >   static int move_ptes(struct pagetable_move_control *pmc,
> > >   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> > >   {
> > > @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >   	struct mm_struct *mm = vma->vm_mm;
> > >   	pte_t *old_ptep, *new_ptep;
> > > -	pte_t pte;
> > > +	pte_t old_pte, pte;
> > >   	pmd_t dummy_pmdval;
> > >   	spinlock_t *old_ptl, *new_ptl;
> > >   	bool force_flush = false;
> > > @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	unsigned long old_end = old_addr + extent;
> > >   	unsigned long len = old_end - old_addr;
> > >   	int err = 0;
> > > +	int max_nr;
> > >
> > >   	/*
> > >   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> > > @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	flush_tlb_batched_pending(vma->vm_mm);
> > >   	arch_enter_lazy_mmu_mode();
> > >
> > > -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > > -				   new_ptep++, new_addr += PAGE_SIZE) {
> > > -		if (pte_none(ptep_get(old_ptep)))
> > > +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
> > > +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
> >
> > Really nitty thing here but the indentation is all messed up here, I mean
> > nothing is going to be nice but maybe indent by two tabs below 'for'.
> >
> > I'm not a fan of this declaration of nr, typically in a for loop a declaration
> > here would be the counter, so this is just confusing.
> >
> > In the old implementation, declaring nr in the for loop would make sense,
> > but in the newly refactored one you should just declare it at the top.
> >
> > Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.
> >
> > I don't think 'nr' needs to be initialised either, since the conditional is
> > 'old_addr < old_end' and you _should_ only perform the
> >
> > > +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> > > +		old_pte = ptep_get(old_ptep);
> > > +		if (pte_none(old_pte))
> >
> > This seems broken.
> >
> > You're missing a nr assignment here, so you'll happen to offset by the
> > number of pages of the last folio you encountered?
> >
> > Should be:
> >
> > 	if (pte_none(old_pte)) {
> > 		nr_ptes = 1;
> > 		continue;
> > 	}
> >
> > Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.
> >
> >
> > >   			continue;
> > >
> > > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> >
> > >   		/*
> > >   		 * If we are remapping a valid PTE, make sure
> > >   		 * to flush TLB before we drop the PTL for the
> > > @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   		 * the TLB entry for the old mapping has been
> > >   		 * flushed.
> > >   		 */
> > > -		if (pte_present(pte))
> > > +		if (pte_present(old_pte)) {
> > > +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> > > +						    old_pte, max_nr);
> > >   			force_flush = true;
> > > +		}
> >
> > Thanks this is much clearer compared to v1
> >
> > > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
> >
> > Nit but...
> >
> > Can we have a comment indicating what the last parameter refers to? I think
> > David maybe doens't like this so obviously if he prefers not that fine, but
> > I'm thinking something like:
> >
> > pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);
> >
> > I think we are good to just use 'false' here right? As it's only an int for
> > historical purposes...
> >
> > >   		pte = move_pte(pte, old_addr, new_addr);
> > >   		pte = move_soft_dirty_pte(pte);
> > >
> > > @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   				else if (is_swap_pte(pte))
> > >   					pte = pte_swp_clear_uffd_wp(pte);
> > >   			}
> > > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > > +			set_ptes(mm, new_addr, new_ptep, pte, nr);
> > >   		}
> > >   	}
> > >
> > > --
> > > 2.30.2
> > >
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching
  2025-05-19  9:04       ` Lorenzo Stoakes
@ 2025-05-20  9:21         ` Dev Jain
  0 siblings, 0 replies; 23+ messages in thread
From: Dev Jain @ 2025-05-20  9:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd



On 19/05/25 2:34 pm, Lorenzo Stoakes wrote:
> On Sun, May 18, 2025 at 01:47:35PM +0530, Dev Jain wrote:
>>
>>
>> On 08/05/25 3:34 pm, Lorenzo Stoakes wrote:
>>> Before getting into the review, just to say thanks for refactoring as per
>>> my (and of course other's) comments, much appreciated and big improvement!
>>> :)
>>>
>>> We're getting there...
>>>
>>> On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
>>>> To use PTE batching, we want to determine whether the folio mapped by
>>>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>>>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>>>> require the folio. For arm64, pte_batch_hint() does the job. To generalize
>>>> this hint, add a helper which will determine whether two consecutive PTEs
>>>> point to consecutive PFNs, in which case there is a high probability that
>>>> the underlying folio is large.
>>>> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>>>> are painted with the contig bit, then ptep_get() will iterate through all 16
>>>> entries to collect a/d bits. Hence this optimization will result in a 16x
>>>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>>>> will eventually call contpte_try_unfold() on every contig block, thus
>>>> flushing the TLB for the complete large folio range. Instead, use
>>>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>>>> do them on the starting and ending contig block.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
>>>>    mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
>>>>    2 files changed, 59 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..38dab1f562ed 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>>>    }
>>>>    #endif
>>>>
>>>> +/**
>>>> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
>>>> + * to a large folio.
>>>> + * @ptep: Pointer to the page table entry.
>>>> + * @pte: The page table entry.
>>>> + *
>>>> + * This helper is invoked when the caller wants to batch over a set of ptes
>>>> + * mapping a large folio, but the concerned code path does not already have
>>>> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
>>>> + * the underlying folio was small; i.e keep the small folio case as fast as
>>>> + * possible.
>>>> + *
>>>> + * The caller must ensure that ptep + 1 exists.
>>>> + */
>>>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	pte_t *next_ptep, next_pte;
>>>> +
>>>> +	if (pte_batch_hint(ptep, pte) != 1)
>>>> +		return true;
>>>> +
>>>> +	next_ptep = ptep + 1;
>>>> +	next_pte = ptep_get(next_ptep);
>>>> +	if (!pte_present(next_pte))
>>>> +		return false;
>>>> +
>>>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
>>>
>>> Let's not do unlikely()'s unless we have data for them... it shouldn't mean
>>> 'what the programmer believes' :)
>>>
>>>> +}
>>>
>>> Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
>>> (I mean _perhaps_ unavoidably) and we've done the relevant check in
>>> mremap_folio_pte_batch(), so let's just move it there with comments, as this
>>>
>>>> +
>>>>    #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>>    static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    					    unsigned long address,
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 0163e02e5aa8..9c88a276bec4 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>>>    	return pte;
>>>>    }
>>>>
>>>> +/* mremap a batch of PTEs mapping the same large folio */
>>>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>>> +		pte_t *ptep, pte_t pte, int max_nr)
>>>> +{
>>>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +	struct folio *folio;
>>>> +	int nr = 1;
>>>> +
>>>> +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
>>>> +		folio = vm_normal_folio(vma, addr, pte);
>>>> +		if (folio && folio_test_large(folio))
>>>> +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
>>>> +					     flags, NULL, NULL, NULL);
>>>> +	}
>>>
>>> This needs some refactoring, avoid nesting at all costs :)
>>>
>>> We'll want to move the maybe_contiguous_pte_pfns() function over here, so
>>> that'll change things, but in general let's use a guard clause.
>>>
>>> So an if block like:
>>>
>>> if (foo) {
>>> 	... bunch of logic ...
>>> }
>>>
>>> Is better replaced with a guard clause so you have:
>>>
>>> if (!foo)
>>> 	return ...;
>>>
>>> ... bunch of logic ...
>>>
>>> Here we could really expand things out to make things SUPER clear like:
>>>
>>> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>> 		pte_t *ptep, pte_t pte, int max_nr)
>>> {
>>> 	const fpb_t flags;
>>> 	struct folio *folio;
>>>
>>> 	if (max_nr == 1)
>>> 		return 1;
>>> 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
>>> 		return 1;
>>>
>>> 	folio = vm_normal_folio(vma, addr, pte);
>>> 	if (!folio)
>>> 		return 1;
>>> 	if (!folio_test_large(folio))
>>> 		return 1;
>>>
>>> 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
>>> 		flags, NULL, NULL, NULL);
>>> }
>>>
>>> I mean you could argue assign nr would be neater here, but you get the point!
>>>
>>> David mentioned a point about this code over in v1 discussion (see
>>> [0]). Trying to bring converastion here to avoid it being split across
>>> old/new series. There he said:
>>>
>>> David H:
>>>> (2) Do we really need "must be part of the same folio", or could be just batch over present
>>>> ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
>>>> might be better.
>>>
>>> Hm, if we didn't do the batch test, can we batch a split large folio here ok?
>>> I'm guessing we can in which case this check is actually limiting...
>>>
>>> Are we _explicitly_ only considering the cont pte case and ignoring the
>>> split THP case?
>>>
>>> [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
>>>
>>> And in what circumstances will the hint be set, with a present subsequent
>>> PTE but !folio_test_large()?
>>>
>>> I guess the hint might not be taken? But then isn't the valid check just
>>> folio_test_large() and we don't need this batched check at all?
>>>
>>> Is it only to avoid the split THP case?
>>>
>>> We definitely need some clarity here, and a comment in the code explaining
>>> what's going on as this is subtle stuff.
>>
>> I am focussed only on batching large folios. Split THPs won't be batched;
>> you can use pte_batch() (from David's refactoring) and
>> figure the split THP batch out, but then get_and_clear_full_ptes()
>> will be gathering a/d bits and smearing them across the batch, which will be
>> incorrect. Even if we introduce a new version of get_and_clear_full_ptes()
>> which does not gather a/d bits, then if the pte_batch actually belongs to a
>> folio, then we will *not* be smearing a/d bits, which is again wrong. So in
>> any case we must know what the underlying folio looks like :) So my agenda
>> for v3 is,
> 
> Right, ack, there being a large folio per se doesn't mean A/D collection is
> appropriate in THP case.
> 
> I guess then this is really _only_ about the mTHP case, where essentially
> the other PTEs are to be disregarded going forwards?

Yes.

> 
>>
>> - Incorporate your refactoring comments
>> - Remove maybe_contiguous_pte_pfns and just use vm_normal_folio +
>> folio_test_large
>> - Fix indentation
>>
>> Sounds good?
> 
> Sure, but can we hold off until the mprotect() stuff is done first please?
> I mean obviously you're free to do things as you like, but this will help
> workload-wise on my side :>)
> 
> Thanks!

No problem, thanks for reviewing!

> 
>>
>>>
>>>> +	return nr;
>>>> +}
>>>> +
>>>>    static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>>>    {
>>>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>    	pte_t *old_ptep, *new_ptep;
>>>> -	pte_t pte;
>>>> +	pte_t old_pte, pte;
>>>>    	pmd_t dummy_pmdval;
>>>>    	spinlock_t *old_ptl, *new_ptl;
>>>>    	bool force_flush = false;
>>>> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	unsigned long old_end = old_addr + extent;
>>>>    	unsigned long len = old_end - old_addr;
>>>>    	int err = 0;
>>>> +	int max_nr;
>>>>
>>>>    	/*
>>>>    	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
>>>> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	flush_tlb_batched_pending(vma->vm_mm);
>>>>    	arch_enter_lazy_mmu_mode();
>>>>
>>>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>>>> -		if (pte_none(ptep_get(old_ptep)))
>>>> +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
>>>> +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
>>>
>>> Really nitty thing here but the indentation is all messed up here, I mean
>>> nothing is going to be nice but maybe indent by two tabs below 'for'.
>>>
>>> I'm not a fan of this declaration of nr, typically in a for loop a declaration
>>> here would be the counter, so this is just confusing.
>>>
>>> In the old implementation, declaring nr in the for loop would make sense,
>>> but in the newly refactored one you should just declare it at the top.
>>>
>>> Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.
>>>
>>> I don't think 'nr' needs to be initialised either, since the conditional is
>>> 'old_addr < old_end' and you _should_ only perform the
>>>
>>>> +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
>>>> +		old_pte = ptep_get(old_ptep);
>>>> +		if (pte_none(old_pte))
>>>
>>> This seems broken.
>>>
>>> You're missing a nr assignment here, so you'll happen to offset by the
>>> number of pages of the last folio you encountered?
>>>
>>> Should be:
>>>
>>> 	if (pte_none(old_pte)) {
>>> 		nr_ptes = 1;
>>> 		continue;
>>> 	}
>>>
>>> Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.
>>>
>>>
>>>>    			continue;
>>>>
>>>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>>
>>>>    		/*
>>>>    		 * If we are remapping a valid PTE, make sure
>>>>    		 * to flush TLB before we drop the PTL for the
>>>> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		 * the TLB entry for the old mapping has been
>>>>    		 * flushed.
>>>>    		 */
>>>> -		if (pte_present(pte))
>>>> +		if (pte_present(old_pte)) {
>>>> +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>>> +						    old_pte, max_nr);
>>>>    			force_flush = true;
>>>> +		}
>>>
>>> Thanks this is much clearer compared to v1
>>>
>>>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
>>>
>>> Nit but...
>>>
>>> Can we have a comment indicating what the last parameter refers to? I think
>>> David maybe doens't like this so obviously if he prefers not that fine, but
>>> I'm thinking something like:
>>>
>>> pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);
>>>
>>> I think we are good to just use 'false' here right? As it's only an int for
>>> historical purposes...
>>>
>>>>    		pte = move_pte(pte, old_addr, new_addr);
>>>>    		pte = move_soft_dirty_pte(pte);
>>>>
>>>> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    				else if (is_swap_pte(pte))
>>>>    					pte = pte_swp_clear_uffd_wp(pte);
>>>>    			}
>>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr);
>>>>    		}
>>>>    	}
>>>>
>>>> --
>>>> 2.30.2
>>>>
>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/2] Optimize mremap() for large folios
  2025-05-07  6:02 [PATCH v2 0/2] Optimize mremap() for large folios Dev Jain
  2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-05-08 18:35 ` Lorenzo Stoakes
  2025-05-09  5:27   ` Dev Jain
  2 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-08 18:35 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

Dev - a general comment here - but let's slow things down a little please
:)

The mprotect() version of this is still outstanding fixes and likely will
need quite a bit of checking before we can ensure it's stabilised.

And now we have this mremap() series as well which also has had quite a few
quite significant issues that have needed addressing.

So can we try to focus on one at a time, and really try to nail down the
series before moving on to the next?

We also have outstanding review on the v1, which has now been split, which
does happen sometimes but perhaps suggests that it'd work better if you
waited a couple days or such to ensure things are settled before sending a
new version when there's quite a bit of feedback?

This isn't a criticism really, sorry I don't mean to sound negative or such
- but this is more a process thing so we reviewers can keep up with things,
keep things rolling, and ensure you get your changes merged asap :)

Thanks, Lorenzo

On Wed, May 07, 2025 at 11:32:54AM +0530, Dev Jain wrote:
> Currently move_ptes() iterates through ptes one by one. If the underlying
> folio mapped by the ptes is large, we can process those ptes in a batch
> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
> For arm64 specifically, this results in a 16x reduction in the number of
> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
> through get_and_clear_full_ptes, replacing ptep_get_and_clear.
>
> Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
> munmapping it 10,000 times, the average execution time reduces from 1.9 to
> 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).
>
> Test program for reference:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <errno.h>
>
> #define SIZE (1UL << 20) // 512 KB
>
> int main(void) {
>     void *new_addr, *addr;
>
>     for (int i = 0; i < 10000; ++i) {
>         addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if (addr == MAP_FAILED) {
>                 perror("mmap");
>                 return 1;
>         }
>         memset(addr, 0xAA, SIZE);
>
>         new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>         if (new_addr != (addr + SIZE)) {
>                 perror("mremap");
>                 return 1;
>         }
>         munmap(new_addr, SIZE);
>     }
>
> }
>
> v1->v2:
>  - Expand patch descriptions, move pte declarations to a new line,
>    reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>    fix loop iteration (Lorenzo)
>  - Merge patch 2 and 3 (Anshuman, Lorenzo)
>  - Fix maybe_contiguous_pte_pfns (Willy)
>
> Dev Jain (2):
>   mm: Call pointers to ptes as ptep
>   mm: Optimize mremap() by PTE batching
>
>  include/linux/pgtable.h | 29 ++++++++++++++++++++++
>  mm/mremap.c             | 54 +++++++++++++++++++++++++++++------------
>  2 files changed, 68 insertions(+), 15 deletions(-)
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/2] Optimize mremap() for large folios
  2025-05-08 18:35 ` [PATCH v2 0/2] Optimize mremap() for large folios Lorenzo Stoakes
@ 2025-05-09  5:27   ` Dev Jain
  2025-05-09  8:44     ` Lorenzo Stoakes
  0 siblings, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-05-09  5:27 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd



On 09/05/25 12:05 am, Lorenzo Stoakes wrote:
> Dev - a general comment here - but let's slow things down a little please
> :)
> 
> The mprotect() version of this is still outstanding fixes and likely will
> need quite a bit of checking before we can ensure it's stabilised.
> 
> And now we have this mremap() series as well which also has had quite a few
> quite significant issues that have needed addressing.
> 
> So can we try to focus on one at a time, and really try to nail down the
> series before moving on to the next?
> 
> We also have outstanding review on the v1, which has now been split, which
> does happen sometimes but perhaps suggests that it'd work better if you
> waited a couple days or such to ensure things are settled before sending a
> new version when there's quite a bit of feedback?

Sure, I should have waited my bad, I usually do, this time I was in a 
haste with both series for no reason :( thanks for your detailed replies 
btw!

> 
> This isn't a criticism really, sorry I don't mean to sound negative or such
> - but this is more a process thing so we reviewers can keep up with things,
> keep things rolling, and ensure you get your changes merged asap :)
> 
> Thanks, Lorenzo
> 
> On Wed, May 07, 2025 at 11:32:54AM +0530, Dev Jain wrote:
>> Currently move_ptes() iterates through ptes one by one. If the underlying
>> folio mapped by the ptes is large, we can process those ptes in a batch
>> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
>> For arm64 specifically, this results in a 16x reduction in the number of
>> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
>> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
>> through get_and_clear_full_ptes, replacing ptep_get_and_clear.
>>
>> Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
>> munmapping it 10,000 times, the average execution time reduces from 1.9 to
>> 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).
>>
>> Test program for reference:
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <errno.h>
>>
>> #define SIZE (1UL << 20) // 512 KB
>>
>> int main(void) {
>>      void *new_addr, *addr;
>>
>>      for (int i = 0; i < 10000; ++i) {
>>          addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>>                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>          if (addr == MAP_FAILED) {
>>                  perror("mmap");
>>                  return 1;
>>          }
>>          memset(addr, 0xAA, SIZE);
>>
>>          new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>>          if (new_addr != (addr + SIZE)) {
>>                  perror("mremap");
>>                  return 1;
>>          }
>>          munmap(new_addr, SIZE);
>>      }
>>
>> }
>>
>> v1->v2:
>>   - Expand patch descriptions, move pte declarations to a new line,
>>     reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>>     fix loop iteration (Lorenzo)
>>   - Merge patch 2 and 3 (Anshuman, Lorenzo)
>>   - Fix maybe_contiguous_pte_pfns (Willy)
>>
>> Dev Jain (2):
>>    mm: Call pointers to ptes as ptep
>>    mm: Optimize mremap() by PTE batching
>>
>>   include/linux/pgtable.h | 29 ++++++++++++++++++++++
>>   mm/mremap.c             | 54 +++++++++++++++++++++++++++++------------
>>   2 files changed, 68 insertions(+), 15 deletions(-)
>>
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/2] Optimize mremap() for large folios
  2025-05-09  5:27   ` Dev Jain
@ 2025-05-09  8:44     ` Lorenzo Stoakes
  2025-05-09  9:26       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-09  8:44 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Fri, May 09, 2025 at 10:57:05AM +0530, Dev Jain wrote:
>
>
> On 09/05/25 12:05 am, Lorenzo Stoakes wrote:
> > Dev - a general comment here - but let's slow things down a little please
> > :)
> >
> > The mprotect() version of this is still outstanding fixes and likely will
> > need quite a bit of checking before we can ensure it's stabilised.
> >
> > And now we have this mremap() series as well which also has had quite a few
> > quite significant issues that have needed addressing.
> >
> > So can we try to focus on one at a time, and really try to nail down the
> > series before moving on to the next?
> >
> > We also have outstanding review on the v1, which has now been split, which
> > does happen sometimes but perhaps suggests that it'd work better if you
> > waited a couple days or such to ensure things are settled before sending a
> > new version when there's quite a bit of feedback?
>
> Sure, I should have waited my bad, I usually do, this time I was in a haste
> with both series for no reason :( thanks for your detailed replies btw!

No problem, I've done this in the past myself haha (sometimes even faster
than you did :P), just a case of tweaking things to fit the mm process :)

Cheers, Lorenzo

>
> >
> > This isn't a criticism really, sorry I don't mean to sound negative or such
> > - but this is more a process thing so we reviewers can keep up with things,
> > keep things rolling, and ensure you get your changes merged asap :)
> >
> > Thanks, Lorenzo
> >
> > On Wed, May 07, 2025 at 11:32:54AM +0530, Dev Jain wrote:
> > > Currently move_ptes() iterates through ptes one by one. If the underlying
> > > folio mapped by the ptes is large, we can process those ptes in a batch
> > > using folio_pte_batch(), thus clearing and setting the PTEs in one go.
> > > For arm64 specifically, this results in a 16x reduction in the number of
> > > ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
> > > through all 16 entries to collect a/d bits), and we also elide extra TLBIs
> > > through get_and_clear_full_ptes, replacing ptep_get_and_clear.
> > >
> > > Mapping 512K of memory, memsetting it, remapping it to src + 512K, and
> > > munmapping it 10,000 times, the average execution time reduces from 1.9 to
> > > 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64).
> > >
> > > Test program for reference:
> > >
> > > #define _GNU_SOURCE
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <unistd.h>
> > > #include <sys/mman.h>
> > > #include <string.h>
> > > #include <errno.h>
> > >
> > > #define SIZE (1UL << 20) // 512 KB
> > >
> > > int main(void) {
> > >      void *new_addr, *addr;
> > >
> > >      for (int i = 0; i < 10000; ++i) {
> > >          addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
> > >                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >          if (addr == MAP_FAILED) {
> > >                  perror("mmap");
> > >                  return 1;
> > >          }
> > >          memset(addr, 0xAA, SIZE);
> > >
> > >          new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
> > >          if (new_addr != (addr + SIZE)) {
> > >                  perror("mremap");
> > >                  return 1;
> > >          }
> > >          munmap(new_addr, SIZE);
> > >      }
> > >
> > > }
> > >
> > > v1->v2:
> > >   - Expand patch descriptions, move pte declarations to a new line,
> > >     reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
> > >     fix loop iteration (Lorenzo)
> > >   - Merge patch 2 and 3 (Anshuman, Lorenzo)
> > >   - Fix maybe_contiguous_pte_pfns (Willy)
> > >
> > > Dev Jain (2):
> > >    mm: Call pointers to ptes as ptep
> > >    mm: Optimize mremap() by PTE batching
> > >
> > >   include/linux/pgtable.h | 29 ++++++++++++++++++++++
> > >   mm/mremap.c             | 54 +++++++++++++++++++++++++++++------------
> > >   2 files changed, 68 insertions(+), 15 deletions(-)
> > >
> > > --
> > > 2.30.2
> > >
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/2] Optimize mremap() for large folios
  2025-05-09  8:44     ` Lorenzo Stoakes
@ 2025-05-09  9:26       ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-09  9:26 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 09.05.25 10:44, Lorenzo Stoakes wrote:
> On Fri, May 09, 2025 at 10:57:05AM +0530, Dev Jain wrote:
>>
>>
>> On 09/05/25 12:05 am, Lorenzo Stoakes wrote:
>>> Dev - a general comment here - but let's slow things down a little please
>>> :)
>>>
>>> The mprotect() version of this is still outstanding fixes and likely will
>>> need quite a bit of checking before we can ensure it's stabilised.
>>>
>>> And now we have this mremap() series as well which also has had quite a few
>>> quite significant issues that have needed addressing.
>>>
>>> So can we try to focus on one at a time, and really try to nail down the
>>> series before moving on to the next?
>>>
>>> We also have outstanding review on the v1, which has now been split, which
>>> does happen sometimes but perhaps suggests that it'd work better if you
>>> waited a couple days or such to ensure things are settled before sending a
>>> new version when there's quite a bit of feedback?
>>
>> Sure, I should have waited my bad, I usually do, this time I was in a haste
>> with both series for no reason :( thanks for your detailed replies btw!
> 
> No problem, I've done this in the past myself haha (sometimes even faster
> than you did :P), just a case of tweaking things to fit the mm process :)

We've all been there, desperately wanting to flush the todo list ... :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-05-20  9:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07  6:02 [PATCH v2 0/2] Optimize mremap() for large folios Dev Jain
2025-05-07  6:02 ` [PATCH v2 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-05-08  1:05   ` Barry Song
2025-05-08  6:21   ` Anshuman Khandual
2025-05-08  9:21   ` Lorenzo Stoakes
2025-05-07  6:02 ` [PATCH v2 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-05-08  1:37   ` Andrew Morton
2025-05-08  4:53     ` Lorenzo Stoakes
2025-05-08  2:00   ` Zi Yan
2025-05-08  4:01     ` Dev Jain
2025-05-08  6:31     ` Anshuman Khandual
2025-05-08  7:16   ` Anshuman Khandual
2025-05-08  8:05     ` Dev Jain
2025-05-08  9:40     ` Lorenzo Stoakes
2025-05-08 10:04   ` Lorenzo Stoakes
2025-05-08 18:01     ` David Hildenbrand
2025-05-18  8:17     ` Dev Jain
2025-05-19  9:04       ` Lorenzo Stoakes
2025-05-20  9:21         ` Dev Jain
2025-05-08 18:35 ` [PATCH v2 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-05-09  5:27   ` Dev Jain
2025-05-09  8:44     ` Lorenzo Stoakes
2025-05-09  9:26       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).