linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Optimize mremap() for large folios
@ 2025-06-10  3:50 Dev Jain
  2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dev Jain @ 2025-06-10  3:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
through all 16 entries to collect a/d bits), and we also elide extra TLBIs
through get_and_clear_full_ptes, replacing ptep_get_and_clear.

Mapping 1M of memory with 64K folios, memsetting it, remapping it to
src + 1M, and munmapping it 10,000 times, the average execution time
reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
on Apple M3 (arm64). No regression is observed for small folios.

The patchset is based on mm-unstable (6ebffe676fcf).

Test program for reference:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

#define SIZE (1UL << 20) // 1M

int main(void) {
    void *new_addr, *addr;

    for (int i = 0; i < 10000; ++i) {
        addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (addr == MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        memset(addr, 0xAA, SIZE);

        new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
        if (new_addr != (addr + SIZE)) {
                perror("mremap");
                return 1;
        }
        munmap(new_addr, SIZE);
    }

}

v3->v4:
 - Remove comment above mremap_folio_pte_batch, improve patch description
   differentiating between folio splitting and pagetable splitting
v2->v3:
 - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
   indentation (Lorenzo), fix cover letter description (512K -> 1M)

v1->v2:
 - Expand patch descriptions, move pte declarations to a new line,
   reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
   fix loop iteration (Lorenzo)
 - Merge patch 2 and 3 (Anshuman, Lorenzo)
 - Fix maybe_contiguous_pte_pfns (Willy)

Dev Jain (2):
  mm: Call pointers to ptes as ptep
  mm: Optimize mremap() by PTE batching

 mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
@ 2025-06-10  3:50 ` Dev Jain
  2025-06-11 13:23   ` David Hildenbrand
  2025-06-12 12:05   ` Pedro Falcato
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 2 replies; 22+ messages in thread
From: Dev Jain @ 2025-06-10  3:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.

Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 60f6b8d0d5f0..180b12225368 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	struct vm_area_struct *vma = pmc->old;
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_ptep, *new_ptep;
+	pte_t pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * We don't have to worry about the ordering of src and dst
 	 * pte locks because exclusive mmap_lock prevents deadlock.
 	 */
-	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
-	if (!old_pte) {
+	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	if (!old_ptep) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -223,10 +224,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 * mmap_lock, so this new_pte page is stable, so there is no need to get
 	 * pmdval and do pmd_same() check.
 	 */
-	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+	new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
 					   &new_ptl);
-	if (!new_pte) {
-		pte_unmap_unlock(old_pte, old_ptl);
+	if (!new_ptep) {
+		pte_unmap_unlock(old_ptep, old_ptl);
 		err = -EAGAIN;
 		goto out;
 	}
@@ -235,14 +236,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
-				   new_pte++, new_addr += PAGE_SIZE) {
-		VM_WARN_ON_ONCE(!pte_none(*new_pte));
+	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
+				   new_ptep++, new_addr += PAGE_SIZE) {
+		VM_WARN_ON_ONCE(!pte_none(*new_ptep));
 
-		if (pte_none(ptep_get(old_pte)))
+		if (pte_none(ptep_get(old_ptep)))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -260,7 +261,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		pte = move_soft_dirty_pte(pte);
 
 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-			pte_clear(mm, new_addr, new_pte);
+			pte_clear(mm, new_addr, new_ptep);
 		else {
 			if (need_clear_uffd_wp) {
 				if (pte_present(pte))
@@ -268,7 +269,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_pte, pte);
+			set_pte_at(mm, new_addr, new_ptep, pte);
 		}
 	}
 
@@ -277,8 +278,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-	pte_unmap(new_pte - 1);
-	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_unmap(new_ptep - 1);
+	pte_unmap_unlock(old_ptep - 1, old_ptl);
 out:
 	if (pmc->need_rmap_locks)
 		drop_rmap_locks(vma);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
  2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-06-10  3:50 ` Dev Jain
  2025-06-10  7:03   ` Barry Song
                     ` (4 more replies)
  2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
  2 siblings, 5 replies; 22+ messages in thread
From: Dev Jain @ 2025-06-10  3:50 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd, Dev Jain

Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.

For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due to
a reduction in the number of function calls.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 180b12225368..18b215521ada 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t pte, int max_nr)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	struct folio *folio;
+
+	if (max_nr == 1)
+		return 1;
+
+	folio = vm_normal_folio(vma, addr, pte);
+	if (!folio || !folio_test_large(folio))
+		return 1;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
+			       NULL, NULL);
+}
+
 static int move_ptes(struct pagetable_move_control *pmc,
 		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
 {
@@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_ptep, *new_ptep;
-	pte_t pte;
+	pte_t old_pte, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	unsigned long new_addr = pmc->new_addr;
 	unsigned long old_end = old_addr + extent;
 	unsigned long len = old_end - old_addr;
+	int max_nr_ptes;
+	int nr_ptes;
 	int err = 0;
 
 	/*
@@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
-				   new_ptep++, new_addr += PAGE_SIZE) {
+	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
+		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
 		VM_WARN_ON_ONCE(!pte_none(*new_ptep));
 
-		if (pte_none(ptep_get(old_ptep)))
+		nr_ptes = 1;
+		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
+		old_pte = ptep_get(old_ptep);
+		if (pte_none(old_pte))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
-		if (pte_present(pte))
+		if (pte_present(old_pte)) {
+			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
+							 old_pte, max_nr_ptes);
 			force_flush = true;
+		}
+		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
@@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_ptep, pte);
+			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
 		}
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-06-10  7:03   ` Barry Song
  2025-06-10  7:44     ` Dev Jain
  2025-06-10  8:37   ` Barry Song
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2025-06-10  7:03 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

Hi Dev,

On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
>
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 180b12225368..18b215521ada 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>         return pte;
>  }
>
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +               pte_t *ptep, pte_t pte, int max_nr)
> +{
> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +       struct folio *folio;
> +
> +       if (max_nr == 1)
> +               return 1;
> +
> +       folio = vm_normal_folio(vma, addr, pte);
> +       if (!folio || !folio_test_large(folio))

I'm curious about the following case:
If the addr/ptep is not the first subpage of the folio—for example, the
14th subpage—will mremap_folio_pte_batch() return 3?
If so, get_and_clear_full_ptes() would operate on 3 subpages of the folio.
In that case, can unfold still work correctly?

Similarly, if the addr/ptep points to the first subpage, but max_nr is
less than CONT_PTES, what will happen in that case?


> +               return 1;
> +
> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +                              NULL, NULL);
> +}
> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>                 unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>         struct mm_struct *mm = vma->vm_mm;
>         pte_t *old_ptep, *new_ptep;
> -       pte_t pte;
> +       pte_t old_pte, pte;
>         pmd_t dummy_pmdval;
>         spinlock_t *old_ptl, *new_ptl;
>         bool force_flush = false;
> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         unsigned long new_addr = pmc->new_addr;
>         unsigned long old_end = old_addr + extent;
>         unsigned long len = old_end - old_addr;
> +       int max_nr_ptes;
> +       int nr_ptes;
>         int err = 0;
>
>         /*
> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
>
> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -                                  new_ptep++, new_addr += PAGE_SIZE) {
> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>                 VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>
> -               if (pte_none(ptep_get(old_ptep)))
> +               nr_ptes = 1;
> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +               old_pte = ptep_get(old_ptep);
> +               if (pte_none(old_pte))
>                         continue;
>
> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>                 /*
>                  * If we are remapping a valid PTE, make sure
>                  * to flush TLB before we drop the PTL for the
> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                  * the TLB entry for the old mapping has been
>                  * flushed.
>                  */
> -               if (pte_present(pte))
> +               if (pte_present(old_pte)) {
> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +                                                        old_pte, max_nr_ptes);
>                         force_flush = true;
> +               }
> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>                 pte = move_pte(pte, old_addr, new_addr);
>                 pte = move_soft_dirty_pte(pte);
>
> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                                 else if (is_swap_pte(pte))
>                                         pte = pte_swp_clear_uffd_wp(pte);
>                         }
> -                       set_pte_at(mm, new_addr, new_ptep, pte);
> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>                 }
>         }
>
> --
> 2.30.2
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  7:03   ` Barry Song
@ 2025-06-10  7:44     ` Dev Jain
  2025-06-10  8:11       ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Dev Jain @ 2025-06-10  7:44 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 10/06/25 12:33 pm, Barry Song wrote:
> Hi Dev,
>
> On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
>>
>> For split folios, there will be no pte batching; nr_ptes will be 1. For
>> pagetable splitting, the ptes will still point to the same large folio;
>> for arm64, this results in the optimization described above, and for other
>> arches (including the general case), a minor improvement is expected due to
>> a reduction in the number of function calls.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>>   1 file changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 180b12225368..18b215521ada 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>          return pte;
>>   }
>>
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +               pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +       struct folio *folio;
>> +
>> +       if (max_nr == 1)
>> +               return 1;
>> +
>> +       folio = vm_normal_folio(vma, addr, pte);
>> +       if (!folio || !folio_test_large(folio))
> I'm curious about the following case:
> If the addr/ptep is not the first subpage of the folio—for example, the
> 14th subpage—will mremap_folio_pte_batch() return 3?

It will return the number of PTEs, starting from the PTE pointing to the 14th
subpage, that point to consecutive pages of the same large folio, up till max_nr.
For an example, if we are operating on a single large folio of order 4, then max_nr
will be 16 - 14 + 1 = 3. So in this case we will return 3, since the 14th, 15th and
16th PTE point to consec pages of the same large folio.

> If so, get_and_clear_full_ptes() would operate on 3 subpages of the folio.
> In that case, can unfold still work correctly?

Yes, first we unfold as in, we do a BBM sequence: cont -> clear -> non-cont.
Then, on this non-contig block, we will clear only the PTEs which were asked
for us to do.

>
> Similarly, if the addr/ptep points to the first subpage, but max_nr is
> less than CONT_PTES, what will happen in that case?
>
>
>> +               return 1;
>> +
>> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +                              NULL, NULL);
>> +}
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>                  unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>          struct mm_struct *mm = vma->vm_mm;
>>          pte_t *old_ptep, *new_ptep;
>> -       pte_t pte;
>> +       pte_t old_pte, pte;
>>          pmd_t dummy_pmdval;
>>          spinlock_t *old_ptl, *new_ptl;
>>          bool force_flush = false;
>> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          unsigned long new_addr = pmc->new_addr;
>>          unsigned long old_end = old_addr + extent;
>>          unsigned long len = old_end - old_addr;
>> +       int max_nr_ptes;
>> +       int nr_ptes;
>>          int err = 0;
>>
>>          /*
>> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>          flush_tlb_batched_pending(vma->vm_mm);
>>          arch_enter_lazy_mmu_mode();
>>
>> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -                                  new_ptep++, new_addr += PAGE_SIZE) {
>> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>
>> -               if (pte_none(ptep_get(old_ptep)))
>> +               nr_ptes = 1;
>> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> +               old_pte = ptep_get(old_ptep);
>> +               if (pte_none(old_pte))
>>                          continue;
>>
>> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>                  /*
>>                   * If we are remapping a valid PTE, make sure
>>                   * to flush TLB before we drop the PTL for the
>> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                   * the TLB entry for the old mapping has been
>>                   * flushed.
>>                   */
>> -               if (pte_present(pte))
>> +               if (pte_present(old_pte)) {
>> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +                                                        old_pte, max_nr_ptes);
>>                          force_flush = true;
>> +               }
>> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>>                  pte = move_pte(pte, old_addr, new_addr);
>>                  pte = move_soft_dirty_pte(pte);
>>
>> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                                  else if (is_swap_pte(pte))
>>                                          pte = pte_swp_clear_uffd_wp(pte);
>>                          }
>> -                       set_pte_at(mm, new_addr, new_ptep, pte);
>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>                  }
>>          }
>>
>> --
>> 2.30.2
>>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  7:44     ` Dev Jain
@ 2025-06-10  8:11       ` Barry Song
  2025-06-16 21:27         ` Ryan Roberts
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2025-06-10  8:11 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 7:45 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 10/06/25 12:33 pm, Barry Song wrote:
> > Hi Dev,
> >
> > On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
> >> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> >> are painted with the contig bit, then ptep_get() will iterate through all 16
> >> entries to collect a/d bits. Hence this optimization will result in a 16x
> >> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> >> will eventually call contpte_try_unfold() on every contig block, thus
> >> flushing the TLB for the complete large folio range. Instead, use
> >> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> >> do them on the starting and ending contig block.
> >>
> >> For split folios, there will be no pte batching; nr_ptes will be 1. For
> >> pagetable splitting, the ptes will still point to the same large folio;
> >> for arm64, this results in the optimization described above, and for other
> >> arches (including the general case), a minor improvement is expected due to
> >> a reduction in the number of function calls.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
> >>   1 file changed, 32 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/mm/mremap.c b/mm/mremap.c
> >> index 180b12225368..18b215521ada 100644
> >> --- a/mm/mremap.c
> >> +++ b/mm/mremap.c
> >> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> >>          return pte;
> >>   }
> >>
> >> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> >> +               pte_t *ptep, pte_t pte, int max_nr)
> >> +{
> >> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> >> +       struct folio *folio;
> >> +
> >> +       if (max_nr == 1)
> >> +               return 1;
> >> +
> >> +       folio = vm_normal_folio(vma, addr, pte);
> >> +       if (!folio || !folio_test_large(folio))
> > I'm curious about the following case:
> > If the addr/ptep is not the first subpage of the folio—for example, the
> > 14th subpage—will mremap_folio_pte_batch() return 3?
>
> It will return the number of PTEs, starting from the PTE pointing to the 14th
> subpage, that point to consecutive pages of the same large folio, up till max_nr.
> For an example, if we are operating on a single large folio of order 4, then max_nr
> will be 16 - 14 + 1 = 3. So in this case we will return 3, since the 14th, 15th and
> 16th PTE point to consec pages of the same large folio.
>
> > If so, get_and_clear_full_ptes() would operate on 3 subpages of the folio.
> > In that case, can unfold still work correctly?
>
> Yes, first we unfold as in, we do a BBM sequence: cont -> clear -> non-cont.
> Then, on this non-contig block, we will clear only the PTEs which were asked
> for us to do.

While going through the code,

static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
                                unsigned long addr, pte_t *ptep,
                                unsigned int nr, int full)
{
        pte_t pte;
        if (likely(nr == 1)) {
                contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
                pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
        } else {
                pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
        }

        return pte;
}

Initially, I thought it only unfolded when nr == 1, but after reading
contpte_get_and_clear_full_ptes more closely, I realized we do
support partial unfolding—that's what I had missed.

pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
                                unsigned long addr, pte_t *ptep,
                                unsigned int nr, int full)
{
        contpte_try_unfold_partial(mm, addr, ptep, nr);
        return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
}

I think you are right.

>
> >
> > Similarly, if the addr/ptep points to the first subpage, but max_nr is
> > less than CONT_PTES, what will happen in that case?
> >
> >
> >> +               return 1;
> >> +
> >> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> >> +                              NULL, NULL);
> >> +}
> >> +
> >>   static int move_ptes(struct pagetable_move_control *pmc,
> >>                  unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> >>   {
> >> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >>          bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> >>          struct mm_struct *mm = vma->vm_mm;
> >>          pte_t *old_ptep, *new_ptep;
> >> -       pte_t pte;
> >> +       pte_t old_pte, pte;
> >>          pmd_t dummy_pmdval;
> >>          spinlock_t *old_ptl, *new_ptl;
> >>          bool force_flush = false;
> >> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >>          unsigned long new_addr = pmc->new_addr;
> >>          unsigned long old_end = old_addr + extent;
> >>          unsigned long len = old_end - old_addr;
> >> +       int max_nr_ptes;
> >> +       int nr_ptes;
> >>          int err = 0;
> >>
> >>          /*
> >> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >>          flush_tlb_batched_pending(vma->vm_mm);
> >>          arch_enter_lazy_mmu_mode();
> >>
> >> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> >> -                                  new_ptep++, new_addr += PAGE_SIZE) {
> >> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> >> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> >>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
> >>
> >> -               if (pte_none(ptep_get(old_ptep)))
> >> +               nr_ptes = 1;
> >> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> >> +               old_pte = ptep_get(old_ptep);
> >> +               if (pte_none(old_pte))
> >>                          continue;
> >>
> >> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> >>                  /*
> >>                   * If we are remapping a valid PTE, make sure
> >>                   * to flush TLB before we drop the PTL for the
> >> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >>                   * the TLB entry for the old mapping has been
> >>                   * flushed.
> >>                   */
> >> -               if (pte_present(pte))
> >> +               if (pte_present(old_pte)) {
> >> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> >> +                                                        old_pte, max_nr_ptes);
> >>                          force_flush = true;
> >> +               }
> >> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> >>                  pte = move_pte(pte, old_addr, new_addr);
> >>                  pte = move_soft_dirty_pte(pte);
> >>
> >> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >>                                  else if (is_swap_pte(pte))
> >>                                          pte = pte_swp_clear_uffd_wp(pte);
> >>                          }
> >> -                       set_pte_at(mm, new_addr, new_ptep, pte);
> >> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> >>                  }
> >>          }
> >>
> >> --
> >> 2.30.2

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-06-10  7:03   ` Barry Song
@ 2025-06-10  8:37   ` Barry Song
  2025-06-10 13:18   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2025-06-10  8:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, ryan.roberts, mingo,
	libang.li, maobibo, zhengqi.arch, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
>
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

The code appears correct to me:

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 180b12225368..18b215521ada 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>         return pte;
>  }
>
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +               pte_t *ptep, pte_t pte, int max_nr)
> +{
> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +       struct folio *folio;
> +
> +       if (max_nr == 1)
> +               return 1;
> +
> +       folio = vm_normal_folio(vma, addr, pte);
> +       if (!folio || !folio_test_large(folio))
> +               return 1;
> +
> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +                              NULL, NULL);
> +}
> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>                 unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>         struct mm_struct *mm = vma->vm_mm;
>         pte_t *old_ptep, *new_ptep;
> -       pte_t pte;
> +       pte_t old_pte, pte;
>         pmd_t dummy_pmdval;
>         spinlock_t *old_ptl, *new_ptl;
>         bool force_flush = false;
> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         unsigned long new_addr = pmc->new_addr;
>         unsigned long old_end = old_addr + extent;
>         unsigned long len = old_end - old_addr;
> +       int max_nr_ptes;
> +       int nr_ptes;
>         int err = 0;
>
>         /*
> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
>
> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -                                  new_ptep++, new_addr += PAGE_SIZE) {
> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>                 VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>
> -               if (pte_none(ptep_get(old_ptep)))
> +               nr_ptes = 1;
> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +               old_pte = ptep_get(old_ptep);
> +               if (pte_none(old_pte))
>                         continue;
>
> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>                 /*
>                  * If we are remapping a valid PTE, make sure
>                  * to flush TLB before we drop the PTL for the
> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                  * the TLB entry for the old mapping has been
>                  * flushed.
>                  */
> -               if (pte_present(pte))
> +               if (pte_present(old_pte)) {
> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +                                                        old_pte, max_nr_ptes);
>                         force_flush = true;
> +               }
> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>                 pte = move_pte(pte, old_addr, new_addr);
>                 pte = move_soft_dirty_pte(pte);
>
> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                                 else if (is_swap_pte(pte))
>                                         pte = pte_swp_clear_uffd_wp(pte);
>                         }
> -                       set_pte_at(mm, new_addr, new_ptep, pte);
> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>                 }
>         }
>
> --
> 2.30.2
>

Thanks

Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/2] Optimize mremap() for large folios
  2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
  2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
@ 2025-06-10 12:11 ` Lorenzo Stoakes
  2025-06-10 12:33   ` Dev Jain
  2 siblings, 1 reply; 22+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 12:11 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 09:20:41AM +0530, Dev Jain wrote:
> Currently move_ptes() iterates through ptes one by one. If the underlying
> folio mapped by the ptes is large, we can process those ptes in a batch
> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
> For arm64 specifically, this results in a 16x reduction in the number of
> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
> through get_and_clear_full_ptes, replacing ptep_get_and_clear.

Thanks this is good!

>
> Mapping 1M of memory with 64K folios, memsetting it, remapping it to
> src + 1M, and munmapping it 10,000 times, the average execution time
> reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
> on Apple M3 (arm64). No regression is observed for small folios.

Hmm, I thought people were struggling to get M3 to work with Asahi? :) or is
this in a mac-based vm? I've not paid attention to recent developments.

>
> The patchset is based on mm-unstable (6ebffe676fcf).
>
> Test program for reference:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <errno.h>
>
> #define SIZE (1UL << 20) // 1M
>
> int main(void) {
>     void *new_addr, *addr;
>
>     for (int i = 0; i < 10000; ++i) {
>         addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if (addr == MAP_FAILED) {
>                 perror("mmap");
>                 return 1;
>         }
>         memset(addr, 0xAA, SIZE);
>
>         new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>         if (new_addr != (addr + SIZE)) {
>                 perror("mremap");
>                 return 1;
>         }
>         munmap(new_addr, SIZE);
>     }
>
> }
>

Thanks for including! Very useful.

> v3->v4:
>  - Remove comment above mremap_folio_pte_batch, improve patch description
>    differentiating between folio splitting and pagetable splitting
> v2->v3:
>  - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
>    indentation (Lorenzo), fix cover letter description (512K -> 1M)

It's nitty but these seem to be getting more and more abbreviated :) not a
massive big deal however ;)

>
> v1->v2:
>  - Expand patch descriptions, move pte declarations to a new line,
>    reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>    fix loop iteration (Lorenzo)
>  - Merge patch 2 and 3 (Anshuman, Lorenzo)
>  - Fix maybe_contiguous_pte_pfns (Willy)
>
> Dev Jain (2):
>   mm: Call pointers to ptes as ptep
>   mm: Optimize mremap() by PTE batching
>
>  mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 42 insertions(+), 16 deletions(-)
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/2] Optimize mremap() for large folios
  2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
@ 2025-06-10 12:33   ` Dev Jain
  0 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-06-10 12:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd


On 10/06/25 5:41 pm, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 09:20:41AM +0530, Dev Jain wrote:
>> Currently move_ptes() iterates through ptes one by one. If the underlying
>> folio mapped by the ptes is large, we can process those ptes in a batch
>> using folio_pte_batch(), thus clearing and setting the PTEs in one go.
>> For arm64 specifically, this results in a 16x reduction in the number of
>> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate
>> through all 16 entries to collect a/d bits), and we also elide extra TLBIs
>> through get_and_clear_full_ptes, replacing ptep_get_and_clear.
> Thanks this is good!
>
>> Mapping 1M of memory with 64K folios, memsetting it, remapping it to
>> src + 1M, and munmapping it 10,000 times, the average execution time
>> reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization,
>> on Apple M3 (arm64). No regression is observed for small folios.
> Hmm, I thought people were struggling to get M3 to work with Asahi? :) or is
> this in a mac-based vm? I've not paid attention to recent developments.

I meant a Linux VM on Mac.

>
>> The patchset is based on mm-unstable (6ebffe676fcf).
>>
>> Test program for reference:
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <errno.h>
>>
>> #define SIZE (1UL << 20) // 1M
>>
>> int main(void) {
>>      void *new_addr, *addr;
>>
>>      for (int i = 0; i < 10000; ++i) {
>>          addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
>>                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>          if (addr == MAP_FAILED) {
>>                  perror("mmap");
>>                  return 1;
>>          }
>>          memset(addr, 0xAA, SIZE);
>>
>>          new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
>>          if (new_addr != (addr + SIZE)) {
>>                  perror("mremap");
>>                  return 1;
>>          }
>>          munmap(new_addr, SIZE);
>>      }
>>
>> }
>>
> Thanks for including! Very useful.
>
>> v3->v4:
>>   - Remove comment above mremap_folio_pte_batch, improve patch description
>>     differentiating between folio splitting and pagetable splitting
>> v2->v3:
>>   - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix
>>     indentation (Lorenzo), fix cover letter description (512K -> 1M)
> It's nitty but these seem to be getting more and more abbreviated :) not a
> massive big deal however ;)
>
>> v1->v2:
>>   - Expand patch descriptions, move pte declarations to a new line,
>>     reduce indentation in patch 2 by introducing mremap_folio_pte_batch(),
>>     fix loop iteration (Lorenzo)
>>   - Merge patch 2 and 3 (Anshuman, Lorenzo)
>>   - Fix maybe_contiguous_pte_pfns (Willy)
>>
>> Dev Jain (2):
>>    mm: Call pointers to ptes as ptep
>>    mm: Optimize mremap() by PTE batching
>>
>>   mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 42 insertions(+), 16 deletions(-)
>>
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
  2025-06-10  7:03   ` Barry Song
  2025-06-10  8:37   ` Barry Song
@ 2025-06-10 13:18   ` Lorenzo Stoakes
  2025-06-11 14:00   ` David Hildenbrand
  2025-06-12 12:13   ` Pedro Falcato
  4 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 13:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 09:20:43AM +0530, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
>
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.

This is a fantastic improvement on commit message, thank you very much!

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Looks good to me:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 180b12225368..18b215521ada 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>

Just for ease of reviewing the review :P

We dropped the comment here as agreed:

> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +
> +	if (max_nr == 1)
> +		return 1;
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (!folio || !folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +			       NULL, NULL);
> +}
> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long new_addr = pmc->new_addr;
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
> +	int max_nr_ptes;
> +	int nr_ptes;
>  	int err = 0;
>
>  	/*
> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>  		VM_WARN_ON_ONCE(!pte_none(*new_ptep));

We added this, which does in fact seem sane. We absolutely should not expect to
see a new PTE here.

The uprobe logic recently addressed actually made it possible for something to
be there (ugh god) but that's now fixed.

>
> -		if (pte_none(ptep_get(old_ptep)))
> +		nr_ptes = 1;
> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +							 old_pte, max_nr_ptes);
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
@ 2025-06-11 13:23   ` David Hildenbrand
  2025-06-11 13:25     ` Dev Jain
  2025-06-12 12:05   ` Pedro Falcato
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2025-06-11 13:23 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 10.06.25 05:50, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> type variables with p. No functional change.
> 
> Reviewed-by: Barry Song <baohua@kernel.org>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/mremap.c | 31 ++++++++++++++++---------------
>   1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 60f6b8d0d5f0..180b12225368 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   	struct vm_area_struct *vma = pmc->old;
>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>   	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_ptep, *new_ptep;
> +	pte_t pte;

Could have left that on the same line ...

>   	pmd_t dummy_pmdval;
>   	spinlock_t *old_ptl, *new_ptl;
>   	bool force_flush = false;
> @@ -211,8 +212,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   	 * We don't have to worry about the ordering of src and dst
>   	 * pte locks because exclusive mmap_lock prevents deadlock.
>   	 */
> -	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> -	if (!old_pte) {
> +	old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> +	if (!old_ptep) {
>   		err = -EAGAIN;

One of those things that's completely inconsistent all over the place.

But yeah, I agree that ptep is much better for a PTE pointer.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-11 13:23   ` David Hildenbrand
@ 2025-06-11 13:25     ` Dev Jain
  2025-06-11 13:29       ` Lorenzo Stoakes
  0 siblings, 1 reply; 22+ messages in thread
From: Dev Jain @ 2025-06-11 13:25 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd


On 11/06/25 6:53 pm, David Hildenbrand wrote:
> On 10.06.25 05:50, Dev Jain wrote:
>> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
>> type variables with p. No functional change.
>>
>> Reviewed-by: Barry Song <baohua@kernel.org>
>> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 31 ++++++++++++++++---------------
>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 60f6b8d0d5f0..180b12225368 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -176,7 +176,8 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>       struct vm_area_struct *vma = pmc->old;
>>       bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>       struct mm_struct *mm = vma->vm_mm;
>> -    pte_t *old_pte, *new_pte, pte;
>> +    pte_t *old_ptep, *new_ptep;
>> +    pte_t pte;
>
> Could have left that on the same line ...

AFAIR Lorenzo had insisted on moving that to a new line.


>
>>       pmd_t dummy_pmdval;
>>       spinlock_t *old_ptl, *new_ptl;
>>       bool force_flush = false;
>> @@ -211,8 +212,8 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>        * We don't have to worry about the ordering of src and dst
>>        * pte locks because exclusive mmap_lock prevents deadlock.
>>        */
>> -    old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
>> -    if (!old_pte) {
>> +    old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
>> +    if (!old_ptep) {
>>           err = -EAGAIN;
>
> One of those things that's completely inconsistent all over the place.
>
> But yeah, I agree that ptep is much better for a PTE pointer.
>
> Acked-by: David Hildenbrand <david@redhat.com>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-11 13:25     ` Dev Jain
@ 2025-06-11 13:29       ` Lorenzo Stoakes
  2025-06-11 13:31         ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Lorenzo Stoakes @ 2025-06-11 13:29 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand, akpm, Liam.Howlett, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Wed, Jun 11, 2025 at 06:55:28PM +0530, Dev Jain wrote:
>
> On 11/06/25 6:53 pm, David Hildenbrand wrote:
> > On 10.06.25 05:50, Dev Jain wrote:
> > > Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> > > type variables with p. No functional change.
> > >
> > > Reviewed-by: Barry Song <baohua@kernel.org>
> > > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/mremap.c | 31 ++++++++++++++++---------------
> > >   1 file changed, 16 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 60f6b8d0d5f0..180b12225368 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -176,7 +176,8 @@ static int move_ptes(struct
> > > pagetable_move_control *pmc,
> > >       struct vm_area_struct *vma = pmc->old;
> > >       bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >       struct mm_struct *mm = vma->vm_mm;
> > > -    pte_t *old_pte, *new_pte, pte;
> > > +    pte_t *old_ptep, *new_ptep;
> > > +    pte_t pte;
> >
> > Could have left that on the same line ...
>
> AFAIR Lorenzo had insisted on moving that to a new line.

Yeah, not a fan of having pointer and non-pointer types declared on same line.
>
>
> >
> > >       pmd_t dummy_pmdval;
> > >       spinlock_t *old_ptl, *new_ptl;
> > >       bool force_flush = false;
> > > @@ -211,8 +212,8 @@ static int move_ptes(struct
> > > pagetable_move_control *pmc,
> > >        * We don't have to worry about the ordering of src and dst
> > >        * pte locks because exclusive mmap_lock prevents deadlock.
> > >        */
> > > -    old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> > > -    if (!old_pte) {
> > > +    old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> > > +    if (!old_ptep) {
> > >           err = -EAGAIN;
> >
> > One of those things that's completely inconsistent all over the place.
> >
> > But yeah, I agree that ptep is much better for a PTE pointer.
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-11 13:29       ` Lorenzo Stoakes
@ 2025-06-11 13:31         ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2025-06-11 13:31 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 11.06.25 15:29, Lorenzo Stoakes wrote:
> On Wed, Jun 11, 2025 at 06:55:28PM +0530, Dev Jain wrote:
>>
>> On 11/06/25 6:53 pm, David Hildenbrand wrote:
>>> On 10.06.25 05:50, Dev Jain wrote:
>>>> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
>>>> type variables with p. No functional change.
>>>>
>>>> Reviewed-by: Barry Song <baohua@kernel.org>
>>>> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    mm/mremap.c | 31 ++++++++++++++++---------------
>>>>    1 file changed, 16 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 60f6b8d0d5f0..180b12225368 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -176,7 +176,8 @@ static int move_ptes(struct
>>>> pagetable_move_control *pmc,
>>>>        struct vm_area_struct *vma = pmc->old;
>>>>        bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>        struct mm_struct *mm = vma->vm_mm;
>>>> -    pte_t *old_pte, *new_pte, pte;
>>>> +    pte_t *old_ptep, *new_ptep;
>>>> +    pte_t pte;
>>>
>>> Could have left that on the same line ...
>>
>> AFAIR Lorenzo had insisted on moving that to a new line.
> 
> Yeah, not a fan of having pointer and non-pointer types declared on same line.
Apparently I am for such simple things.

The joy of not having a common coding style. :)

Anyhow, I don't really care, just something I noticed ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
                     ` (2 preceding siblings ...)
  2025-06-10 13:18   ` Lorenzo Stoakes
@ 2025-06-11 14:00   ` David Hildenbrand
  2025-06-13  4:24     ` Dev Jain
  2025-06-13 12:32     ` Lorenzo Stoakes
  2025-06-12 12:13   ` Pedro Falcato
  4 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2025-06-11 14:00 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 10.06.25 05:50, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
> 
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>   1 file changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 180b12225368..18b215521ada 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>   	return pte;
>   }
>   
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +
> +	if (max_nr == 1)
> +		return 1;
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (!folio || !folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +			       NULL, NULL);
> +}
> +
>   static int move_ptes(struct pagetable_move_control *pmc,
>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>   {
> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>   	struct mm_struct *mm = vma->vm_mm;
>   	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>   	pmd_t dummy_pmdval;
>   	spinlock_t *old_ptl, *new_ptl;
>   	bool force_flush = false;
> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   	unsigned long new_addr = pmc->new_addr;
>   	unsigned long old_end = old_addr + extent;
>   	unsigned long len = old_end - old_addr;
> +	int max_nr_ptes;
> +	int nr_ptes;
>   	int err = 0;
>   
>   	/*
> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   	flush_tlb_batched_pending(vma->vm_mm);
>   	arch_enter_lazy_mmu_mode();
>   
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>   		VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>   
> -		if (pte_none(ptep_get(old_ptep)))
> +		nr_ptes = 1;
> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>   			continue;
>   
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>   		/*
>   		 * If we are remapping a valid PTE, make sure
>   		 * to flush TLB before we drop the PTL for the
> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   		 * the TLB entry for the old mapping has been
>   		 * flushed.
>   		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +							 old_pte, max_nr_ptes);
>   			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>   		pte = move_pte(pte, old_addr, new_addr);
>   		pte = move_soft_dirty_pte(pte);
>   
> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   				else if (is_swap_pte(pte))
>   					pte = pte_swp_clear_uffd_wp(pte);
>   			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);


What I dislike is that some paths work on a single PTE, and we implicitly have to know
that they don't apply for !pte_present.

Like
	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))

Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...

Should we properly separate both paths (present vs. !present), and while at it, do
some more cleanups? I'm thinking of the following on top (only compile-tested)


diff --git a/mm/mremap.c b/mm/mremap.c
index 18b215521adae..b88abf02b34e0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
                 i_mmap_unlock_write(vma->vm_file->f_mapping);
  }
  
-static pte_t move_soft_dirty_pte(pte_t pte)
-{
-       /*
-        * Set soft dirty bit so we can notice
-        * in userspace the ptes were moved.
-        */
-#ifdef CONFIG_MEM_SOFT_DIRTY
-       if (pte_present(pte))
-               pte = pte_mksoft_dirty(pte);
-       else if (is_swap_pte(pte))
-               pte = pte_swp_mksoft_dirty(pte);
-#endif
-       return pte;
-}
-
  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
                 pte_t *ptep, pte_t pte, int max_nr)
  {
@@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
                 VM_WARN_ON_ONCE(!pte_none(*new_ptep));
  
                 nr_ptes = 1;
-               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
                 old_pte = ptep_get(old_ptep);
                 if (pte_none(old_pte))
                         continue;
@@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
                  * flushed.
                  */
                 if (pte_present(old_pte)) {
+                       max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
                         nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
                                                          old_pte, max_nr_ptes);
                         force_flush = true;
-               }
-               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
-               pte = move_pte(pte, old_addr, new_addr);
-               pte = move_soft_dirty_pte(pte);
-
-               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-                       pte_clear(mm, new_addr, new_ptep);
-               else {
-                       if (need_clear_uffd_wp) {
-                               if (pte_present(pte))
-                                       pte = pte_clear_uffd_wp(pte);
-                               else if (is_swap_pte(pte))
+
+                       pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
+                                                     nr_ptes, 0);
+                       /*
+                        * Moving present PTEs requires special care on some
+                        * archs.
+                        */
+                       pte = move_pte(pte, old_addr, new_addr);
+                       /* make userspace aware that this pte moved. */
+                       pte = pte_mksoft_dirty(pte);
+                       if (need_clear_uffd_wp)
+                               pte = pte_clear_uffd_wp(pte);
+                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
+               } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
+                       pte_clear(mm, old_addr, old_ptep);
+               } else {
+                       pte_clear(mm, old_addr, old_ptep);
+                       if (is_swap_pte(pte)) {
+                               if (need_clear_uffd_wp)
                                         pte = pte_swp_clear_uffd_wp(pte);
+                               /* make userspace aware that this pte moved. */
+                               pte = pte_swp_mksoft_dirty(pte);
                         }
-                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
+                       set_pte_at(mm, new_addr, new_ptep, pte);
                 }
         }
  


Note that I don't know why we had the existing

-               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
-                       pte_clear(mm, new_addr, new_ptep);


I thought we would always expect that the destination pte is already pte_none() ?

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] mm: Call pointers to ptes as ptep
  2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
  2025-06-11 13:23   ` David Hildenbrand
@ 2025-06-12 12:05   ` Pedro Falcato
  1 sibling, 0 replies; 22+ messages in thread
From: Pedro Falcato @ 2025-06-12 12:05 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 09:20:42AM +0530, Dev Jain wrote:
> Avoid confusion between pte_t* and pte_t data types by suffixing pointer
> type variables with p. No functional change.
> 
> Reviewed-by: Barry Song <baohua@kernel.org>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Thanks, this is less confusing!

Feel free to add:
Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
                     ` (3 preceding siblings ...)
  2025-06-11 14:00   ` David Hildenbrand
@ 2025-06-12 12:13   ` Pedro Falcato
  4 siblings, 0 replies; 22+ messages in thread
From: Pedro Falcato @ 2025-06-12 12:13 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, linux-mm,
	linux-kernel, david, peterx, ryan.roberts, mingo, libang.li,
	maobibo, zhengqi.arch, baohua, anshuman.khandual, willy,
	ioworker0, yang, baolin.wang, ziy, hughd

On Tue, Jun 10, 2025 at 09:20:43AM +0530, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
> 
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-11 14:00   ` David Hildenbrand
@ 2025-06-13  4:24     ` Dev Jain
  2025-06-17  8:02       ` David Hildenbrand
  2025-06-13 12:32     ` Lorenzo Stoakes
  1 sibling, 1 reply; 22+ messages in thread
From: Dev Jain @ 2025-06-13  4:24 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd


On 11/06/25 7:30 pm, David Hildenbrand wrote:
> On 10.06.25 05:50, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through 
>> all 16
>> entries to collect a/d bits. Hence this optimization will result in a 
>> 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, 
>> and only
>> do them on the starting and ending contig block.
>>
>> For split folios, there will be no pte batching; nr_ptes will be 1. For
>> pagetable splitting, the ptes will still point to the same large folio;
>> for arm64, this results in the optimization described above, and for 
>> other
>> arches (including the general case), a minor improvement is expected 
>> due to
>> a reduction in the number of function calls.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>>   1 file changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 180b12225368..18b215521ada 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>       return pte;
>>   }
>>   +static int mremap_folio_pte_batch(struct vm_area_struct *vma, 
>> unsigned long addr,
>> +        pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +    struct folio *folio;
>> +
>> +    if (max_nr == 1)
>> +        return 1;
>> +
>> +    folio = vm_normal_folio(vma, addr, pte);
>> +    if (!folio || !folio_test_large(folio))
>> +        return 1;
>> +
>> +    return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +                   NULL, NULL);
>> +}
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>           unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +194,7 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>       bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>       struct mm_struct *mm = vma->vm_mm;
>>       pte_t *old_ptep, *new_ptep;
>> -    pte_t pte;
>> +    pte_t old_pte, pte;
>>       pmd_t dummy_pmdval;
>>       spinlock_t *old_ptl, *new_ptl;
>>       bool force_flush = false;
>> @@ -185,6 +202,8 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>       unsigned long new_addr = pmc->new_addr;
>>       unsigned long old_end = old_addr + extent;
>>       unsigned long len = old_end - old_addr;
>> +    int max_nr_ptes;
>> +    int nr_ptes;
>>       int err = 0;
>>         /*
>> @@ -236,14 +255,16 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>       flush_tlb_batched_pending(vma->vm_mm);
>>       arch_enter_lazy_mmu_mode();
>>   -    for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -                   new_ptep++, new_addr += PAGE_SIZE) {
>> +    for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += 
>> nr_ptes * PAGE_SIZE,
>> +        new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>           VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>   -        if (pte_none(ptep_get(old_ptep)))
>> +        nr_ptes = 1;
>> +        max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> +        old_pte = ptep_get(old_ptep);
>> +        if (pte_none(old_pte))
>>               continue;
>>   -        pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>           /*
>>            * If we are remapping a valid PTE, make sure
>>            * to flush TLB before we drop the PTL for the
>> @@ -255,8 +276,12 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>            * the TLB entry for the old mapping has been
>>            * flushed.
>>            */
>> -        if (pte_present(pte))
>> +        if (pte_present(old_pte)) {
>> +            nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +                             old_pte, max_nr_ptes);
>>               force_flush = true;
>> +        }
>> +        pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, 
>> nr_ptes, 0);
>>           pte = move_pte(pte, old_addr, new_addr);
>>           pte = move_soft_dirty_pte(pte);
>>   @@ -269,7 +294,7 @@ static int move_ptes(struct 
>> pagetable_move_control *pmc,
>>                   else if (is_swap_pte(pte))
>>                       pte = pte_swp_clear_uffd_wp(pte);
>>               }
>> -            set_pte_at(mm, new_addr, new_ptep, pte);
>> +            set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>
>
> What I dislike is that some paths work on a single PTE, and we 
> implicitly have to know
> that they don't apply for !pte_present.
>
> Like
>     if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>
> Will not get batched yet. And that is hidden inside the 
> pte_marker_uffd_wp check ...
>
> Should we properly separate both paths (present vs. !present), and 
> while at it, do
> some more cleanups? I'm thinking of the following on top (only 
> compile-tested)

Good observation! Just one doubt, see below.


>
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 18b215521adae..b88abf02b34e0 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct 
> *vma)
> i_mmap_unlock_write(vma->vm_file->f_mapping);
>  }
>
> -static pte_t move_soft_dirty_pte(pte_t pte)
> -{
> -       /*
> -        * Set soft dirty bit so we can notice
> -        * in userspace the ptes were moved.
> -        */
> -#ifdef CONFIG_MEM_SOFT_DIRTY
> -       if (pte_present(pte))
> -               pte = pte_mksoft_dirty(pte);
> -       else if (is_swap_pte(pte))
> -               pte = pte_swp_mksoft_dirty(pte);
> -#endif
> -       return pte;
> -}
> -
>  static int mremap_folio_pte_batch(struct vm_area_struct *vma, 
> unsigned long addr,
>                 pte_t *ptep, pte_t pte, int max_nr)
>  {
> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control 
> *pmc,
>                 VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>
>                 nr_ptes = 1;
> -               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>                 old_pte = ptep_get(old_ptep);
>                 if (pte_none(old_pte))
>                         continue;
> @@ -277,24 +261,34 @@ static int move_ptes(struct 
> pagetable_move_control *pmc,
>                  * flushed.
>                  */
>                 if (pte_present(old_pte)) {
> +                       max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>                         nr_ptes = mremap_folio_pte_batch(vma, 
> old_addr, old_ptep,
>                                                          old_pte, 
> max_nr_ptes);
>                         force_flush = true;
> -               }
> -               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, 
> nr_ptes, 0);
> -               pte = move_pte(pte, old_addr, new_addr);
> -               pte = move_soft_dirty_pte(pte);
> -
> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -                       pte_clear(mm, new_addr, new_ptep);
> -               else {
> -                       if (need_clear_uffd_wp) {
> -                               if (pte_present(pte))
> -                                       pte = pte_clear_uffd_wp(pte);
> -                               else if (is_swap_pte(pte))
> +
> +                       pte = get_and_clear_full_ptes(mm, old_addr, 
> old_ptep,
> +                                                     nr_ptes, 0);
> +                       /*
> +                        * Moving present PTEs requires special care 
> on some
> +                        * archs.
> +                        */
> +                       pte = move_pte(pte, old_addr, new_addr);
> +                       /* make userspace aware that this pte moved. */
> +                       pte = pte_mksoft_dirty(pte);
> +                       if (need_clear_uffd_wp)
> +                               pte = pte_clear_uffd_wp(pte);
> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> +               } else if (need_clear_uffd_wp && 
> pte_marker_uffd_wp(pte)) {
> +                       pte_clear(mm, old_addr, old_ptep);
> +               } else {
> +                       pte_clear(mm, old_addr, old_ptep);

Should pte_clear be included here? It is currently being done only for 
the case

need_clear_uffd_wp && pte_marker_uffd_wp().


> + if (is_swap_pte(pte)) {
> +                               if (need_clear_uffd_wp)
>                                         pte = pte_swp_clear_uffd_wp(pte);
> +                               /* make userspace aware that this pte 
> moved. */
> +                               pte = pte_swp_mksoft_dirty(pte);
>                         }
> -                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> +                       set_pte_at(mm, new_addr, new_ptep, pte);
>                 }
>         }
>
>
>
> Note that I don't know why we had the existing
>
> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -                       pte_clear(mm, new_addr, new_ptep);
>
>
> I thought we would always expect that the destination pte is already 
> pte_none() ?
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-11 14:00   ` David Hildenbrand
  2025-06-13  4:24     ` Dev Jain
@ 2025-06-13 12:32     ` Lorenzo Stoakes
  2025-06-16 16:13       ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 12:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On Wed, Jun 11, 2025 at 04:00:41PM +0200, David Hildenbrand wrote:
> On 10.06.25 05:50, Dev Jain wrote:
> > Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > are painted with the contig bit, then ptep_get() will iterate through all 16
> > entries to collect a/d bits. Hence this optimization will result in a 16x
> > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > will eventually call contpte_try_unfold() on every contig block, thus
> > flushing the TLB for the complete large folio range. Instead, use
> > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > do them on the starting and ending contig block.
> >
> > For split folios, there will be no pte batching; nr_ptes will be 1. For
> > pagetable splitting, the ptes will still point to the same large folio;
> > for arm64, this results in the optimization described above, and for other
> > arches (including the general case), a minor improvement is expected due to
> > a reduction in the number of function calls.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
> >   1 file changed, 32 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 180b12225368..18b215521ada 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> >   	return pte;
> >   }
> > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > +		pte_t *ptep, pte_t pte, int max_nr)
> > +{
> > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > +	struct folio *folio;
> > +
> > +	if (max_nr == 1)
> > +		return 1;
> > +
> > +	folio = vm_normal_folio(vma, addr, pte);
> > +	if (!folio || !folio_test_large(folio))
> > +		return 1;
> > +
> > +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> > +			       NULL, NULL);
> > +}
> > +
> >   static int move_ptes(struct pagetable_move_control *pmc,
> >   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> >   {
> > @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> >   	struct mm_struct *mm = vma->vm_mm;
> >   	pte_t *old_ptep, *new_ptep;
> > -	pte_t pte;
> > +	pte_t old_pte, pte;
> >   	pmd_t dummy_pmdval;
> >   	spinlock_t *old_ptl, *new_ptl;
> >   	bool force_flush = false;
> > @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >   	unsigned long new_addr = pmc->new_addr;
> >   	unsigned long old_end = old_addr + extent;
> >   	unsigned long len = old_end - old_addr;
> > +	int max_nr_ptes;
> > +	int nr_ptes;
> >   	int err = 0;
> >   	/*
> > @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >   	flush_tlb_batched_pending(vma->vm_mm);
> >   	arch_enter_lazy_mmu_mode();
> > -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > -				   new_ptep++, new_addr += PAGE_SIZE) {
> > +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> > +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> >   		VM_WARN_ON_ONCE(!pte_none(*new_ptep));
> > -		if (pte_none(ptep_get(old_ptep)))
> > +		nr_ptes = 1;
> > +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> > +		old_pte = ptep_get(old_ptep);
> > +		if (pte_none(old_pte))
> >   			continue;
> > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> >   		/*
> >   		 * If we are remapping a valid PTE, make sure
> >   		 * to flush TLB before we drop the PTL for the
> > @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >   		 * the TLB entry for the old mapping has been
> >   		 * flushed.
> >   		 */
> > -		if (pte_present(pte))
> > +		if (pte_present(old_pte)) {
> > +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> > +							 old_pte, max_nr_ptes);
> >   			force_flush = true;
> > +		}
> > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> >   		pte = move_pte(pte, old_addr, new_addr);
> >   		pte = move_soft_dirty_pte(pte);
> > @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> >   				else if (is_swap_pte(pte))
> >   					pte = pte_swp_clear_uffd_wp(pte);
> >   			}
> > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>
>
> What I dislike is that some paths work on a single PTE, and we implicitly have to know
> that they don't apply for !pte_present.

I hate any kind of implicit knowledge like this.

>
> Like
> 	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))

I also despise [with words I cannot use on-list] how uffd is implemented.

It's _nothing but_ ad-hoc stuff like this spawned all around the place.

It's hateful.

>
> Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...
>
> Should we properly separate both paths (present vs. !present), and while at it, do
> some more cleanups? I'm thinking of the following on top (only compile-tested)

I'd like to see that, but I think maybe better as a follow up series?

On the other hand, this does improve this quite a bit. Could also be another
patch in the series.

>
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 18b215521adae..b88abf02b34e0 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
>                 i_mmap_unlock_write(vma->vm_file->f_mapping);
>  }
> -static pte_t move_soft_dirty_pte(pte_t pte)
> -{
> -       /*
> -        * Set soft dirty bit so we can notice
> -        * in userspace the ptes were moved.
> -        */
> -#ifdef CONFIG_MEM_SOFT_DIRTY
> -       if (pte_present(pte))
> -               pte = pte_mksoft_dirty(pte);
> -       else if (is_swap_pte(pte))
> -               pte = pte_swp_mksoft_dirty(pte);
> -#endif
> -       return pte;
> -}
> -
>  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>                 pte_t *ptep, pte_t pte, int max_nr)
>  {
> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                 VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>                 nr_ptes = 1;
> -               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>                 old_pte = ptep_get(old_ptep);
>                 if (pte_none(old_pte))
>                         continue;
> @@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
>                  * flushed.
>                  */
>                 if (pte_present(old_pte)) {
> +                       max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>                         nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>                                                          old_pte, max_nr_ptes);
>                         force_flush = true;
> -               }
> -               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> -               pte = move_pte(pte, old_addr, new_addr);
> -               pte = move_soft_dirty_pte(pte);
> -
> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -                       pte_clear(mm, new_addr, new_ptep);
> -               else {
> -                       if (need_clear_uffd_wp) {
> -                               if (pte_present(pte))
> -                                       pte = pte_clear_uffd_wp(pte);
> -                               else if (is_swap_pte(pte))
> +
> +                       pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
> +                                                     nr_ptes, 0);
> +                       /*
> +                        * Moving present PTEs requires special care on some
> +                        * archs.
> +                        */
> +                       pte = move_pte(pte, old_addr, new_addr);

I guess we're good with only doing this in pte_present() case because the only
arch that implements this, sparc, does a present check anyway.

> +                       /* make userspace aware that this pte moved. */
> +                       pte = pte_mksoft_dirty(pte);
> +                       if (need_clear_uffd_wp)
> +                               pte = pte_clear_uffd_wp(pte);
> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> +               } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
> +                       pte_clear(mm, old_addr, old_ptep);

Same comment as below re: pte_clear().

I see you've dropped pte_clear(mm, new_addr, new_ptep) which I guess is
purposefully?

I do think that it is pointless yes.

> +               } else {
> +                       pte_clear(mm, old_addr, old_ptep);

I guess this is intended to replace ptep_get_and_clear_full_ptes() above in the
single PTE case... no?  Is this sufficient?

In the original code we'd always do ptep_get_and_clear().

I think the key difference is page_table_check_pte_clear().

I notice, hilariously, that there is a ptep_clear() that _does_ call this. What
a mess.


> +                       if (is_swap_pte(pte)) {
> +                               if (need_clear_uffd_wp)
>                                         pte = pte_swp_clear_uffd_wp(pte);
> +                               /* make userspace aware that this pte moved. */
> +                               pte = pte_swp_mksoft_dirty(pte);
>                         }
> -                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> +                       set_pte_at(mm, new_addr, new_ptep, pte);
>                 }
>         }
>
>
> Note that I don't know why we had the existing
>
> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> -                       pte_clear(mm, new_addr, new_ptep);
>
>
> I thought we would always expect that the destination pte is already pte_none() ?

I think this is because we already did the move_pte() call in the original code
before checking this:

		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
		pte = move_pte(pte, old_addr, new_addr);
		pte = move_soft_dirty_pte(pte);

		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
			pte_clear(mm, new_addr, new_ptep);

But maybe it's because there was a presumption move_pte() would like you know,
move a PTE entry? Which it doesn't, it - only on SPARC - does a hook to flush
the dcache.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-13 12:32     ` Lorenzo Stoakes
@ 2025-06-16 16:13       ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2025-06-16 16:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, Liam.Howlett, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

[...]

>>>    			}
>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>
>>
>> What I dislike is that some paths work on a single PTE, and we implicitly have to know
>> that they don't apply for !pte_present.
> 
> I hate any kind of implicit knowledge like this.
> 
>>
>> Like
>> 	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> 
> I also despise [with words I cannot use on-list] how uffd is implemented.
> 
> It's _nothing but_ ad-hoc stuff like this spawned all around the place.
> 
> It's hateful.
> 
>>
>> Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...
>>
>> Should we properly separate both paths (present vs. !present), and while at it, do
>> some more cleanups? I'm thinking of the following on top (only compile-tested)
> 
> I'd like to see that, but I think maybe better as a follow up series?

I'd do the split as a cleanup patch upfront.

Then add the batching for the pte_present() case.

> 
> On the other hand, this does improve this quite a bit. Could also be another
> patch in the series.
> 
>>
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 18b215521adae..b88abf02b34e0 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
>>                  i_mmap_unlock_write(vma->vm_file->f_mapping);
>>   }
>> -static pte_t move_soft_dirty_pte(pte_t pte)
>> -{
>> -       /*
>> -        * Set soft dirty bit so we can notice
>> -        * in userspace the ptes were moved.
>> -        */
>> -#ifdef CONFIG_MEM_SOFT_DIRTY
>> -       if (pte_present(pte))
>> -               pte = pte_mksoft_dirty(pte);
>> -       else if (is_swap_pte(pte))
>> -               pte = pte_swp_mksoft_dirty(pte);
>> -#endif
>> -       return pte;
>> -}
>> -
>>   static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>                  pte_t *ptep, pte_t pte, int max_nr)
>>   {
>> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>                  nr_ptes = 1;
>> -               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>                  old_pte = ptep_get(old_ptep);
>>                  if (pte_none(old_pte))
>>                          continue;
>> @@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                   * flushed.
>>                   */
>>                  if (pte_present(old_pte)) {
>> +                       max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>                          nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>                                                           old_pte, max_nr_ptes);
>>                          force_flush = true;
>> -               }
>> -               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>> -               pte = move_pte(pte, old_addr, new_addr);
>> -               pte = move_soft_dirty_pte(pte);
>> -
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>> -               else {
>> -                       if (need_clear_uffd_wp) {
>> -                               if (pte_present(pte))
>> -                                       pte = pte_clear_uffd_wp(pte);
>> -                               else if (is_swap_pte(pte))
>> +
>> +                       pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
>> +                                                     nr_ptes, 0);
>> +                       /*
>> +                        * Moving present PTEs requires special care on some
>> +                        * archs.
>> +                        */
>> +                       pte = move_pte(pte, old_addr, new_addr);
> 
> I guess we're good with only doing this in pte_present() case because the only
> arch that implements this, sparc, does a present check anyway.

Yes, we could then remove the call from pte_present(), see below.

> 
>> +                       /* make userspace aware that this pte moved. */
>> +                       pte = pte_mksoft_dirty(pte);
>> +                       if (need_clear_uffd_wp)
>> +                               pte = pte_clear_uffd_wp(pte);
>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> +               } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
>> +                       pte_clear(mm, old_addr, old_ptep);
> 
> Same comment as below re: pte_clear().
> 
> I see you've dropped pte_clear(mm, new_addr, new_ptep) which I guess is
> purposefully?
> 
> I do think that it is pointless yes.

Yeah, that's what I raised below.

> 
>> +               } else {
>> +                       pte_clear(mm, old_addr, old_ptep);
> 
> I guess this is intended to replace ptep_get_and_clear_full_ptes() above in the
> single PTE case... no?  Is this sufficient?
> 
> In the original code we'd always do ptep_get_and_clear().
> 
> I think the key difference is page_table_check_pte_clear().
> 
> I notice, hilariously, that there is a ptep_clear() that _does_ call this. What
> a mess.

ptep_get_and_clear() is only relevant when something is present and 
could change concurrently (especially A/D bits managed by HW).

We already obtained the pte, it's not present, and now just want to 
clear it.

> 
> 
>> +                       if (is_swap_pte(pte)) {
>> +                               if (need_clear_uffd_wp)
>>                                          pte = pte_swp_clear_uffd_wp(pte);
>> +                               /* make userspace aware that this pte moved. */
>> +                               pte = pte_swp_mksoft_dirty(pte);
>>                          }
>> -                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> +                       set_pte_at(mm, new_addr, new_ptep, pte);
>>                  }
>>          }
>>
>>
>> Note that I don't know why we had the existing
>>
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>>
>>
>> I thought we would always expect that the destination pte is already pte_none() ?
> 
> I think this is because we already did the move_pte() call in the original code
> before checking this:

Oh, maybe that's why.

> 
> 		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> 		pte = move_pte(pte, old_addr, new_addr);
> 		pte = move_soft_dirty_pte(pte);
> 
> 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> 			pte_clear(mm, new_addr, new_ptep);
> 
> But maybe it's because there was a presumption move_pte() would like you know,
> move a PTE entry? Which it doesn't, it - only on SPARC - does a hook to flush
> the dcache.

I wish we could remove move_pte() completely. Or just notify about the 
move of a present pte ... because it doesn't ever modify the pte val.

So renaming it to "arch_notify_move_pte()" or sth. like that that is a 
void function might even be better.

... I think we could go one step further if we already have the folio: 
we could call it arch_notify_move_folio_pte(), and simplify the sparc 
implementation ...

Anyhow, the move_pte() cleanup can be done separately. Splitting the 
present from !present case here should probably be done as a cleanup 
upfront.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-10  8:11       ` Barry Song
@ 2025-06-16 21:27         ` Ryan Roberts
  0 siblings, 0 replies; 22+ messages in thread
From: Ryan Roberts @ 2025-06-16 21:27 UTC (permalink / raw)
  To: Barry Song, Dev Jain
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	linux-mm, linux-kernel, david, peterx, mingo, libang.li, maobibo,
	zhengqi.arch, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd

On 10/06/2025 09:11, Barry Song wrote:
> On Tue, Jun 10, 2025 at 7:45 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 10/06/25 12:33 pm, Barry Song wrote:
>>> Hi Dev,
>>>
>>> On Tue, Jun 10, 2025 at 3:51 PM Dev Jain <dev.jain@arm.com> wrote:
>>>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>>>> are painted with the contig bit, then ptep_get() will iterate through all 16
>>>> entries to collect a/d bits. Hence this optimization will result in a 16x
>>>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>>>> will eventually call contpte_try_unfold() on every contig block, thus
>>>> flushing the TLB for the complete large folio range. Instead, use
>>>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>>>> do them on the starting and ending contig block.
>>>>
>>>> For split folios, there will be no pte batching; nr_ptes will be 1. For
>>>> pagetable splitting, the ptes will still point to the same large folio;
>>>> for arm64, this results in the optimization described above, and for other
>>>> arches (including the general case), a minor improvement is expected due to
>>>> a reduction in the number of function calls.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>>>>   1 file changed, 32 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 180b12225368..18b215521ada 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>>>          return pte;
>>>>   }
>>>>
>>>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>>> +               pte_t *ptep, pte_t pte, int max_nr)
>>>> +{
>>>> +       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +       struct folio *folio;
>>>> +
>>>> +       if (max_nr == 1)
>>>> +               return 1;
>>>> +
>>>> +       folio = vm_normal_folio(vma, addr, pte);
>>>> +       if (!folio || !folio_test_large(folio))
>>> I'm curious about the following case:
>>> If the addr/ptep is not the first subpage of the folio—for example, the
>>> 14th subpage—will mremap_folio_pte_batch() return 3?
>>
>> It will return the number of PTEs, starting from the PTE pointing to the 14th
>> subpage, that point to consecutive pages of the same large folio, up till max_nr.
>> For an example, if we are operating on a single large folio of order 4, then max_nr
>> will be 16 - 14 + 1 = 3. So in this case we will return 3, since the 14th, 15th and
>> 16th PTE point to consec pages of the same large folio.
>>
>>> If so, get_and_clear_full_ptes() would operate on 3 subpages of the folio.
>>> In that case, can unfold still work correctly?
>>
>> Yes, first we unfold as in, we do a BBM sequence: cont -> clear -> non-cont.
>> Then, on this non-contig block, we will clear only the PTEs which were asked
>> for us to do.
> 
> While going through the code,
> 
> static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>                                 unsigned long addr, pte_t *ptep,
>                                 unsigned int nr, int full)
> {
>         pte_t pte;
>         if (likely(nr == 1)) {
>                 contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>                 pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>         } else {
>                 pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>         }
> 
>         return pte;
> }
> 
> Initially, I thought it only unfolded when nr == 1, but after reading
> contpte_get_and_clear_full_ptes more closely, I realized we do
> support partial unfolding—that's what I had missed.
> 
> pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>                                 unsigned long addr, pte_t *ptep,
>                                 unsigned int nr, int full)
> {
>         contpte_try_unfold_partial(mm, addr, ptep, nr);
>         return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> }
> 
> I think you are right.

Yes, Dev is correct; this works as intended. And yes, the code looks a bit dumb,
but IIRC, this inline special-casing on nr=1 was needed to prevent fork and/or
munmap microbenchmarks from regressing for the common small folio case.

Thanks,
Ryan

> 
>>
>>>
>>> Similarly, if the addr/ptep points to the first subpage, but max_nr is
>>> less than CONT_PTES, what will happen in that case?
>>>
>>>
>>>> +               return 1;
>>>> +
>>>> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>>>> +                              NULL, NULL);
>>>> +}
>>>> +
>>>>   static int move_ptes(struct pagetable_move_control *pmc,
>>>>                  unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>>>   {
>>>> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>          bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>          struct mm_struct *mm = vma->vm_mm;
>>>>          pte_t *old_ptep, *new_ptep;
>>>> -       pte_t pte;
>>>> +       pte_t old_pte, pte;
>>>>          pmd_t dummy_pmdval;
>>>>          spinlock_t *old_ptl, *new_ptl;
>>>>          bool force_flush = false;
>>>> @@ -185,6 +202,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>          unsigned long new_addr = pmc->new_addr;
>>>>          unsigned long old_end = old_addr + extent;
>>>>          unsigned long len = old_end - old_addr;
>>>> +       int max_nr_ptes;
>>>> +       int nr_ptes;
>>>>          int err = 0;
>>>>
>>>>          /*
>>>> @@ -236,14 +255,16 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>          flush_tlb_batched_pending(vma->vm_mm);
>>>>          arch_enter_lazy_mmu_mode();
>>>>
>>>> -       for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>>> -                                  new_ptep++, new_addr += PAGE_SIZE) {
>>>> +       for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>>>> +               new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>>>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>>>
>>>> -               if (pte_none(ptep_get(old_ptep)))
>>>> +               nr_ptes = 1;
>>>> +               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>>> +               old_pte = ptep_get(old_ptep);
>>>> +               if (pte_none(old_pte))
>>>>                          continue;
>>>>
>>>> -               pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>>>                  /*
>>>>                   * If we are remapping a valid PTE, make sure
>>>>                   * to flush TLB before we drop the PTL for the
>>>> @@ -255,8 +276,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>                   * the TLB entry for the old mapping has been
>>>>                   * flushed.
>>>>                   */
>>>> -               if (pte_present(pte))
>>>> +               if (pte_present(old_pte)) {
>>>> +                       nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>>> +                                                        old_pte, max_nr_ptes);
>>>>                          force_flush = true;
>>>> +               }
>>>> +               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>>>>                  pte = move_pte(pte, old_addr, new_addr);
>>>>                  pte = move_soft_dirty_pte(pte);
>>>>
>>>> @@ -269,7 +294,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>                                  else if (is_swap_pte(pte))
>>>>                                          pte = pte_swp_clear_uffd_wp(pte);
>>>>                          }
>>>> -                       set_pte_at(mm, new_addr, new_ptep, pte);
>>>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>>>                  }
>>>>          }
>>>>
>>>> --
>>>> 2.30.2
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
  2025-06-13  4:24     ` Dev Jain
@ 2025-06-17  8:02       ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2025-06-17  8:02 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato, linux-mm,
	linux-kernel, peterx, ryan.roberts, mingo, libang.li, maobibo,
	zhengqi.arch, baohua, anshuman.khandual, willy, ioworker0, yang,
	baolin.wang, ziy, hughd


>> -               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
>> nr_ptes, 0);
>> -               pte = move_pte(pte, old_addr, new_addr);
>> -               pte = move_soft_dirty_pte(pte);
>> -
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>> -               else {
>> -                       if (need_clear_uffd_wp) {
>> -                               if (pte_present(pte))
>> -                                       pte = pte_clear_uffd_wp(pte);
>> -                               else if (is_swap_pte(pte))
>> +
>> +                       pte = get_and_clear_full_ptes(mm, old_addr,
>> old_ptep,
>> +                                                     nr_ptes, 0);
>> +                       /*
>> +                        * Moving present PTEs requires special care
>> on some
>> +                        * archs.
>> +                        */
>> +                       pte = move_pte(pte, old_addr, new_addr);
>> +                       /* make userspace aware that this pte moved. */
>> +                       pte = pte_mksoft_dirty(pte);
>> +                       if (need_clear_uffd_wp)
>> +                               pte = pte_clear_uffd_wp(pte);
>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> +               } else if (need_clear_uffd_wp &&
>> pte_marker_uffd_wp(pte)) {
>> +                       pte_clear(mm, old_addr, old_ptep);
>> +               } else {
>> +                       pte_clear(mm, old_addr, old_ptep);
> 
> Should pte_clear be included here? It is currently being done only for
> the case
> 
> need_clear_uffd_wp && pte_marker_uffd_wp().

We always cleared the old pte (using get_and_clear_*, which is not 
required if we already know that it is !present).

The case you mean was what I describe here:

>>
>>
>>
>> Note that I don't know why we had the existing
>>
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>>
>>
>> I thought we would always expect that the destination pte is already
>> pte_none() ?

So clearing the destination PTE can be dropped if we didn't move in the 
first place.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-17  8:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-06-11 13:23   ` David Hildenbrand
2025-06-11 13:25     ` Dev Jain
2025-06-11 13:29       ` Lorenzo Stoakes
2025-06-11 13:31         ` David Hildenbrand
2025-06-12 12:05   ` Pedro Falcato
2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-06-10  7:03   ` Barry Song
2025-06-10  7:44     ` Dev Jain
2025-06-10  8:11       ` Barry Song
2025-06-16 21:27         ` Ryan Roberts
2025-06-10  8:37   ` Barry Song
2025-06-10 13:18   ` Lorenzo Stoakes
2025-06-11 14:00   ` David Hildenbrand
2025-06-13  4:24     ` Dev Jain
2025-06-17  8:02       ` David Hildenbrand
2025-06-13 12:32     ` Lorenzo Stoakes
2025-06-16 16:13       ` David Hildenbrand
2025-06-12 12:13   ` Pedro Falcato
2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-06-10 12:33   ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).