[PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
@ 2025-08-10  6:29 Lokesh Gidra
  2025-08-11  3:55 ` Barry Song
  0 siblings, 1 reply; 9+ messages in thread
From: Lokesh Gidra @ 2025-08-10  6:29 UTC (permalink / raw)
  To: akpm
  Cc: aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Lokesh Gidra, Suren Baghdasaryan, Kalesh Singh, Barry Song,
	David Hildenbrand, Peter Xu

MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
for moving present pages. Mitigate this cost by opportunistically
batching present contiguous pages for TLB flushing.

Without batching, in our testing on an arm64 Android device with UFFD GC,
which uses MOVE ioctl for compaction, we observed that out of the total
time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
~20% in vm_normal_folio().

With batching, the proportion of vm_normal_folio() increases to over
70% of move_pages_pte() without any changes to vm_normal_folio().
Furthermore, time spent within move_pages_pte() is only ~20%, which
includes TLB-flush overhead.

Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
Changes since v3 [1]
- Fix unintialized 'step_size' warning, per Dan Carpenter
- Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
- Removed flush_cache_range() in zero-page case, per Peter Xu
- Added comment to explain why folio reference for batched pages is not
  required, per Peter Xu
- Use MIN() in calculation of largest extent that can be batched under
  the same src and dst PTLs, per Peter Xu
- Release first folio's reference in move_present_ptes(), per Peter Xu

Changes since v2 [2]
- Addressed VM_WARN_ON failure, per Lorenzo Stoakes
- Added check to ensure all batched pages share the same anon_vma

Changes since v1 [3]
- Removed flush_tlb_batched_pending(), per Barry Song
- Unified single and multi page case, per Barry Song

[1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
[2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
[3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/

 mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 127 insertions(+), 51 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index cbed91b09640..39d81d2972db 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
 	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
 }
 
-static int move_present_pte(struct mm_struct *mm,
-			    struct vm_area_struct *dst_vma,
-			    struct vm_area_struct *src_vma,
-			    unsigned long dst_addr, unsigned long src_addr,
-			    pte_t *dst_pte, pte_t *src_pte,
-			    pte_t orig_dst_pte, pte_t orig_src_pte,
-			    pmd_t *dst_pmd, pmd_t dst_pmdval,
-			    spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			    struct folio *src_folio)
+/*
+ * Checks if the two ptes and the corresponding folio are eligible for batched
+ * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
+ *
+ * NOTE: folio's reference is not required as the whole operation is within
+ * PTL's critical section.
+ */
+static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
+						 unsigned long src_addr,
+						 pte_t *src_pte, pte_t *dst_pte,
+						 struct anon_vma *src_anon_vma)
+{
+	pte_t orig_dst_pte, orig_src_pte;
+	struct folio *folio;
+
+	orig_dst_pte = ptep_get(dst_pte);
+	if (!pte_none(orig_dst_pte))
+		return NULL;
+
+	orig_src_pte = ptep_get(src_pte);
+	if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
+		return NULL;
+
+	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
+	if (!folio || !folio_trylock(folio))
+		return NULL;
+	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
+	    folio_anon_vma(folio) != src_anon_vma) {
+		folio_unlock(folio);
+		return NULL;
+	}
+	return folio;
+}
+
+static long move_present_ptes(struct mm_struct *mm,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr, unsigned long src_addr,
+			      pte_t *dst_pte, pte_t *src_pte,
+			      pte_t orig_dst_pte, pte_t orig_src_pte,
+			      pmd_t *dst_pmd, pmd_t dst_pmdval,
+			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
+			      struct folio **first_src_folio, unsigned long len,
+			      struct anon_vma *src_anon_vma)
 {
 	int err = 0;
+	struct folio *src_folio = *first_src_folio;
+	unsigned long src_start = src_addr;
+	unsigned long addr_end;
+
+	if (len > PAGE_SIZE) {
+		addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
+		len = MIN(addr_end - dst_addr, len);
 
+		addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
+		len = MIN(addr_end - src_addr, len);
+	}
+	flush_cache_range(src_vma, src_addr, src_addr + len);
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1051,31 +1097,57 @@ static int move_present_pte(struct mm_struct *mm,
 		err = -EBUSY;
 		goto out;
 	}
+	/* It's safe to drop the reference now as the page-table is holding one. */
+	folio_put(*first_src_folio);
+	*first_src_folio = NULL;
+	arch_enter_lazy_mmu_mode();
+
+	addr_end = src_addr + len;
+	while (true) {
+		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
+		/* Folio got pinned from under us. Put it back and fail the move. */
+		if (folio_maybe_dma_pinned(src_folio)) {
+			set_pte_at(mm, src_addr, src_pte, orig_src_pte);
+			err = -EBUSY;
+			break;
+		}
 
-	orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
-	/* Folio got pinned from under us. Put it back and fail the move. */
-	if (folio_maybe_dma_pinned(src_folio)) {
-		set_pte_at(mm, src_addr, src_pte, orig_src_pte);
-		err = -EBUSY;
-		goto out;
-	}
-
-	folio_move_anon_rmap(src_folio, dst_vma);
-	src_folio->index = linear_page_index(dst_vma, dst_addr);
+		folio_move_anon_rmap(src_folio, dst_vma);
+		src_folio->index = linear_page_index(dst_vma, dst_addr);
 
-	orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
-	/* Set soft dirty bit so userspace can notice the pte was moved */
+		orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
+		/* Set soft dirty bit so userspace can notice the pte was moved */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-	orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
+		orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
 #endif
-	if (pte_dirty(orig_src_pte))
-		orig_dst_pte = pte_mkdirty(orig_dst_pte);
-	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+		if (pte_dirty(orig_src_pte))
+			orig_dst_pte = pte_mkdirty(orig_dst_pte);
+		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		src_addr += PAGE_SIZE;
+		if (src_addr == addr_end)
+			break;
+		dst_addr += PAGE_SIZE;
+		dst_pte++;
+		src_pte++;
+
+		folio_unlock(src_folio);
+		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
+							dst_pte, src_anon_vma);
+		if (!src_folio)
+			break;
+	}
+
+	arch_leave_lazy_mmu_mode();
+	if (src_addr > src_start)
+		flush_tlb_range(src_vma, src_start, src_addr);
 
-	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+	if (src_folio)
+		folio_unlock(src_folio);
 out:
 	double_pt_unlock(dst_ptl, src_ptl);
-	return err;
+	return src_addr > src_start ? src_addr - src_start : err;
 }
 
 static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
@@ -1140,7 +1212,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 	set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
 	double_pt_unlock(dst_ptl, src_ptl);
 
-	return 0;
+	return PAGE_SIZE;
 }
 
 static int move_zeropage_pte(struct mm_struct *mm,
@@ -1167,20 +1239,19 @@ static int move_zeropage_pte(struct mm_struct *mm,
 	set_pte_at(mm, dst_addr, dst_pte, zero_pte);
 	double_pt_unlock(dst_ptl, src_ptl);
 
-	return 0;
+	return PAGE_SIZE;
 }
 
 
 /*
- * The mmap_lock for reading is held by the caller. Just move the page
- * from src_pmd to dst_pmd if possible, and return true if succeeded
- * in moving the page.
+ * The mmap_lock for reading is held by the caller. Just move the page(s)
+ * from src_pmd to dst_pmd if possible, and return number of bytes moved.
  */
-static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
-			  struct vm_area_struct *dst_vma,
-			  struct vm_area_struct *src_vma,
-			  unsigned long dst_addr, unsigned long src_addr,
-			  __u64 mode)
+static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
+			    struct vm_area_struct *dst_vma,
+			    struct vm_area_struct *src_vma,
+			    unsigned long dst_addr, unsigned long src_addr,
+			    unsigned long len, __u64 mode)
 {
 	swp_entry_t entry;
 	struct swap_info_struct *si = NULL;
@@ -1196,9 +1267,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	struct mmu_notifier_range range;
 	int err = 0;
 
-	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
-				src_addr, src_addr + PAGE_SIZE);
+				src_addr, src_addr + len);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
 	/*
@@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
 			err = -ENOENT;
 		else /* nothing to do to move a hole */
-			err = 0;
+			err = PAGE_SIZE;
 		goto out;
 	}
 
@@ -1375,10 +1445,11 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			}
 		}
 
-		err = move_present_pte(mm,  dst_vma, src_vma,
-				       dst_addr, src_addr, dst_pte, src_pte,
-				       orig_dst_pte, orig_src_pte, dst_pmd,
-				       dst_pmdval, dst_ptl, src_ptl, src_folio);
+		err = move_present_ptes(mm, dst_vma, src_vma,
+					dst_addr, src_addr, dst_pte, src_pte,
+					orig_dst_pte, orig_src_pte, dst_pmd,
+					dst_pmdval, dst_ptl, src_ptl, &src_folio,
+					len, src_anon_vma);
 	} else {
 		struct folio *folio = NULL;
 
@@ -1732,7 +1803,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 {
 	struct mm_struct *mm = ctx->mm;
 	struct vm_area_struct *src_vma, *dst_vma;
-	unsigned long src_addr, dst_addr;
+	unsigned long src_addr, dst_addr, src_end;
 	pmd_t *src_pmd, *dst_pmd;
 	long err = -EINVAL;
 	ssize_t moved = 0;
@@ -1775,8 +1846,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 	if (err)
 		goto out_unlock;
 
-	for (src_addr = src_start, dst_addr = dst_start;
-	     src_addr < src_start + len;) {
+	for (src_addr = src_start, dst_addr = dst_start, src_end = src_start + len;
+	     src_addr < src_end;) {
 		spinlock_t *ptl;
 		pmd_t dst_pmdval;
 		unsigned long step_size;
@@ -1841,6 +1912,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 						  dst_addr, src_addr);
 			step_size = HPAGE_PMD_SIZE;
 		} else {
+			long ret;
+
 			if (pmd_none(*src_pmd)) {
 				if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
 					err = -ENOENT;
@@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				break;
 			}
 
-			err = move_pages_pte(mm, dst_pmd, src_pmd,
-					     dst_vma, src_vma,
-					     dst_addr, src_addr, mode);
-			step_size = PAGE_SIZE;
+			ret = move_pages_ptes(mm, dst_pmd, src_pmd,
+					      dst_vma, src_vma, dst_addr,
+					      src_addr, src_end - src_addr, mode);
+			if (ret < 0)
+				err = ret;
+			else
+				step_size = ret;
 		}
 
 		cond_resched();

base-commit: 561c80369df0733ba0574882a1635287b20f9de2
-- 
2.50.1.703.g449372360f-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-10  6:29 [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE Lokesh Gidra
@ 2025-08-11  3:55 ` Barry Song
  2025-08-12 14:44   ` Peter Xu
  2025-08-12 15:50   ` Lokesh Gidra
  0 siblings, 2 replies; 9+ messages in thread
From: Barry Song @ 2025-08-11  3:55 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand,
	Peter Xu

Hi Lokesh,


On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> for moving present pages. Mitigate this cost by opportunistically
> batching present contiguous pages for TLB flushing.
>
> Without batching, in our testing on an arm64 Android device with UFFD GC,
> which uses MOVE ioctl for compaction, we observed that out of the total
> time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> ~20% in vm_normal_folio().
>
> With batching, the proportion of vm_normal_folio() increases to over
> 70% of move_pages_pte() without any changes to vm_normal_folio().
> Furthermore, time spent within move_pages_pte() is only ~20%, which
> includes TLB-flush overhead.
>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
> Changes since v3 [1]
> - Fix unintialized 'step_size' warning, per Dan Carpenter
> - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> - Removed flush_cache_range() in zero-page case, per Peter Xu
> - Added comment to explain why folio reference for batched pages is not
>   required, per Peter Xu
> - Use MIN() in calculation of largest extent that can be batched under
>   the same src and dst PTLs, per Peter Xu
> - Release first folio's reference in move_present_ptes(), per Peter Xu
>
> Changes since v2 [2]
> - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> - Added check to ensure all batched pages share the same anon_vma
>
> Changes since v1 [3]
> - Removed flush_tlb_batched_pending(), per Barry Song
> - Unified single and multi page case, per Barry Song
>
> [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
>
>  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 127 insertions(+), 51 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index cbed91b09640..39d81d2972db 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
>  }
>
> -static int move_present_pte(struct mm_struct *mm,
> -                           struct vm_area_struct *dst_vma,
> -                           struct vm_area_struct *src_vma,
> -                           unsigned long dst_addr, unsigned long src_addr,
> -                           pte_t *dst_pte, pte_t *src_pte,
> -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -                           struct folio *src_folio)
> +/*
> + * Checks if the two ptes and the corresponding folio are eligible for batched
> + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> + *
> + * NOTE: folio's reference is not required as the whole operation is within
> + * PTL's critical section.
> + */
> +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> +                                                unsigned long src_addr,
> +                                                pte_t *src_pte, pte_t *dst_pte,
> +                                                struct anon_vma *src_anon_vma)
> +{
> +       pte_t orig_dst_pte, orig_src_pte;
> +       struct folio *folio;
> +
> +       orig_dst_pte = ptep_get(dst_pte);
> +       if (!pte_none(orig_dst_pte))
> +               return NULL;
> +
> +       orig_src_pte = ptep_get(src_pte);
> +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> +               return NULL;
> +
> +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> +       if (!folio || !folio_trylock(folio))
> +               return NULL;
> +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> +           folio_anon_vma(folio) != src_anon_vma) {
> +               folio_unlock(folio);
> +               return NULL;
> +       }
> +       return folio;
> +}
> +

I’m still quite confused by the code. Before move_present_ptes(), we’ve
already performed all the checks—pte_same(), vm_normal_folio(),
folio_trylock(), folio_test_large(), folio_get_anon_vma(),
and anon_vma_lock_write()—at least for the first PTE. Now we’re
duplicating them again for all PTEs. Does this mean we’re doing those
operations for the first PTE twice? It feels like the old non-batch check
code should be removed?

> +static long move_present_ptes(struct mm_struct *mm,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr, unsigned long src_addr,
> +                             pte_t *dst_pte, pte_t *src_pte,
> +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> +                             struct folio **first_src_folio, unsigned long len,
> +                             struct anon_vma *src_anon_vma)
>  {
>         int err = 0;
> +       struct folio *src_folio = *first_src_folio;
> +       unsigned long src_start = src_addr;
> +       unsigned long addr_end;
> +
> +       if (len > PAGE_SIZE) {
> +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> +               len = MIN(addr_end - dst_addr, len);
>
> +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> +               len = MIN(addr_end - src_addr, len);
> +       }

We already have a pmd_addr_end() helper—can we reuse it?

[...]

>         /*
> @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
>                         err = -ENOENT;
>                 else /* nothing to do to move a hole */
> -                       err = 0;
> +                       err = PAGE_SIZE;

To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
code to make it more readable?

[...]

> @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>                                 break;
>                         }
>
> -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> -                                            dst_vma, src_vma,
> -                                            dst_addr, src_addr, mode);
> -                       step_size = PAGE_SIZE;
> +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> +                                             dst_vma, src_vma, dst_addr,
> +                                             src_addr, src_end - src_addr, mode);
> +                       if (ret < 0)
> +                               err = ret;
> +                       else
> +                               step_size = ret;

also looks a bit strange :-)

>                 }
>
>                 cond_resched();
>
> base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> --
> 2.50.1.703.g449372360f-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-11  3:55 ` Barry Song
@ 2025-08-12 14:44   ` Peter Xu
  2025-08-12 15:44     ` Lokesh Gidra
  2025-08-12 15:50   ` Lokesh Gidra
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Xu @ 2025-08-12 14:44 UTC (permalink / raw)
  To: Barry Song
  Cc: Lokesh Gidra, akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> Hi Lokesh,
> 
> 
> On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> >
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> >
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > includes TLB-flush overhead.
> >
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Barry Song <v-songbaohua@oppo.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> > Changes since v3 [1]
> > - Fix unintialized 'step_size' warning, per Dan Carpenter
> > - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> > - Removed flush_cache_range() in zero-page case, per Peter Xu
> > - Added comment to explain why folio reference for batched pages is not
> >   required, per Peter Xu
> > - Use MIN() in calculation of largest extent that can be batched under
> >   the same src and dst PTLs, per Peter Xu
> > - Release first folio's reference in move_present_ptes(), per Peter Xu
> >
> > Changes since v2 [2]
> > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > - Added check to ensure all batched pages share the same anon_vma
> >
> > Changes since v1 [3]
> > - Removed flush_tlb_batched_pending(), per Barry Song
> > - Unified single and multi page case, per Barry Song
> >
> > [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> > [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> >
> >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 127 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index cbed91b09640..39d81d2972db 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> >  }
> >
> > -static int move_present_pte(struct mm_struct *mm,
> > -                           struct vm_area_struct *dst_vma,
> > -                           struct vm_area_struct *src_vma,
> > -                           unsigned long dst_addr, unsigned long src_addr,
> > -                           pte_t *dst_pte, pte_t *src_pte,
> > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                           struct folio *src_folio)
> > +/*
> > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > + *
> > + * NOTE: folio's reference is not required as the whole operation is within
> > + * PTL's critical section.
> > + */
> > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > +                                                unsigned long src_addr,
> > +                                                pte_t *src_pte, pte_t *dst_pte,
> > +                                                struct anon_vma *src_anon_vma)
> > +{
> > +       pte_t orig_dst_pte, orig_src_pte;
> > +       struct folio *folio;
> > +
> > +       orig_dst_pte = ptep_get(dst_pte);
> > +       if (!pte_none(orig_dst_pte))
> > +               return NULL;
> > +
> > +       orig_src_pte = ptep_get(src_pte);
> > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > +               return NULL;
> > +
> > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > +       if (!folio || !folio_trylock(folio))
> > +               return NULL;
> > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > +           folio_anon_vma(folio) != src_anon_vma) {
> > +               folio_unlock(folio);
> > +               return NULL;
> > +       }
> > +       return folio;
> > +}
> > +
> 
> I’m still quite confused by the code. Before move_present_ptes(), we’ve
> already performed all the checks—pte_same(), vm_normal_folio(),
> folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> and anon_vma_lock_write()—at least for the first PTE. Now we’re
> duplicating them again for all PTEs. Does this mean we’re doing those
> operations for the first PTE twice? It feels like the old non-batch check
> code should be removed?

This function should only start to work on the 2nd (or more) continuous
ptes to move within the same pgtable lock held.  We'll still need the
original path because that was sleepable, this one isn't, and it's only
best-effort fast path only. E.g. if trylock() fails above, it would
fallback to the slow path.

> 
> > +static long move_present_ptes(struct mm_struct *mm,
> > +                             struct vm_area_struct *dst_vma,
> > +                             struct vm_area_struct *src_vma,
> > +                             unsigned long dst_addr, unsigned long src_addr,
> > +                             pte_t *dst_pte, pte_t *src_pte,
> > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                             struct folio **first_src_folio, unsigned long len,
> > +                             struct anon_vma *src_anon_vma)
> >  {
> >         int err = 0;
> > +       struct folio *src_folio = *first_src_folio;
> > +       unsigned long src_start = src_addr;
> > +       unsigned long addr_end;
> > +
> > +       if (len > PAGE_SIZE) {
> > +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - dst_addr, len);
> >
> > +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - src_addr, len);
> > +       }
> 
> We already have a pmd_addr_end() helper—can we reuse it?

I agree with Barry; I wante to say this version didn't use ALIGN() that I
suggested but pmd_addr_end() looks better.

Other than that this version looks good to me (plus the higher level
performance results updated to the commit message, per requested in v3),
thanks Lokesh.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-12 14:44   ` Peter Xu
@ 2025-08-12 15:44     ` Lokesh Gidra
  2025-08-13  9:03       ` Barry Song
  0 siblings, 1 reply; 9+ messages in thread
From: Lokesh Gidra @ 2025-08-12 15:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Barry Song, akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > Hi Lokesh,
> >
> >
> > On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > for moving present pages. Mitigate this cost by opportunistically
> > > batching present contiguous pages for TLB flushing.
> > >
> > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > ~20% in vm_normal_folio().
> > >
> > > With batching, the proportion of vm_normal_folio() increases to over
> > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > > includes TLB-flush overhead.
> > >
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > Cc: Barry Song <v-songbaohua@oppo.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > ---
> > > Changes since v3 [1]
> > > - Fix unintialized 'step_size' warning, per Dan Carpenter
> > > - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> > > - Removed flush_cache_range() in zero-page case, per Peter Xu
> > > - Added comment to explain why folio reference for batched pages is not
> > >   required, per Peter Xu
> > > - Use MIN() in calculation of largest extent that can be batched under
> > >   the same src and dst PTLs, per Peter Xu
> > > - Release first folio's reference in move_present_ptes(), per Peter Xu
> > >
> > > Changes since v2 [2]
> > > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > > - Added check to ensure all batched pages share the same anon_vma
> > >
> > > Changes since v1 [3]
> > > - Removed flush_tlb_batched_pending(), per Barry Song
> > > - Unified single and multi page case, per Barry Song
> > >
> > > [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> > > [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > > [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> > >
> > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index cbed91b09640..39d81d2972db 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > >  }
> > >
> > > -static int move_present_pte(struct mm_struct *mm,
> > > -                           struct vm_area_struct *dst_vma,
> > > -                           struct vm_area_struct *src_vma,
> > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > -                           struct folio *src_folio)
> > > +/*
> > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > + *
> > > + * NOTE: folio's reference is not required as the whole operation is within
> > > + * PTL's critical section.
> > > + */
> > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > +                                                unsigned long src_addr,
> > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > +                                                struct anon_vma *src_anon_vma)
> > > +{
> > > +       pte_t orig_dst_pte, orig_src_pte;
> > > +       struct folio *folio;
> > > +
> > > +       orig_dst_pte = ptep_get(dst_pte);
> > > +       if (!pte_none(orig_dst_pte))
> > > +               return NULL;
> > > +
> > > +       orig_src_pte = ptep_get(src_pte);
> > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > +               return NULL;
> > > +
> > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > +       if (!folio || !folio_trylock(folio))
> > > +               return NULL;
> > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > +               folio_unlock(folio);
> > > +               return NULL;
> > > +       }
> > > +       return folio;
> > > +}
> > > +
> >
> > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > already performed all the checks—pte_same(), vm_normal_folio(),
> > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > duplicating them again for all PTEs. Does this mean we’re doing those
> > operations for the first PTE twice? It feels like the old non-batch check
> > code should be removed?
>
> This function should only start to work on the 2nd (or more) continuous
> ptes to move within the same pgtable lock held.  We'll still need the
> original path because that was sleepable, this one isn't, and it's only
> best-effort fast path only. E.g. if trylock() fails above, it would
> fallback to the slow path.
>
Thanks Peter. I was about to give exactly the same reasoning :)
> >
> > > +static long move_present_ptes(struct mm_struct *mm,
> > > +                             struct vm_area_struct *dst_vma,
> > > +                             struct vm_area_struct *src_vma,
> > > +                             unsigned long dst_addr, unsigned long src_addr,
> > > +                             pte_t *dst_pte, pte_t *src_pte,
> > > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > +                             struct folio **first_src_folio, unsigned long len,
> > > +                             struct anon_vma *src_anon_vma)
> > >  {
> > >         int err = 0;
> > > +       struct folio *src_folio = *first_src_folio;
> > > +       unsigned long src_start = src_addr;
> > > +       unsigned long addr_end;
> > > +
> > > +       if (len > PAGE_SIZE) {
> > > +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > > +               len = MIN(addr_end - dst_addr, len);
> > >
> > > +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > > +               len = MIN(addr_end - src_addr, len);
> > > +       }
> >
> > We already have a pmd_addr_end() helper—can we reuse it?
>
> I agree with Barry; I wante to say this version didn't use ALIGN() that I
> suggested but pmd_addr_end() looks better.

ALIGN() couldn't be used as we are calculating "how many bytes to the
next PMD" and not just align it. Anyways, pmd_addr_end() is definitely
better. Will do it in the next patch.
>
> Other than that this version looks good to me (plus the higher level
> performance results updated to the commit message, per requested in v3),
> thanks Lokesh.

Thanks Peter. I'll update the commit message in v5.
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-11  3:55 ` Barry Song
  2025-08-12 14:44   ` Peter Xu
@ 2025-08-12 15:50   ` Lokesh Gidra
  2025-08-13  9:29     ` Barry Song
  1 sibling, 1 reply; 9+ messages in thread
From: Lokesh Gidra @ 2025-08-12 15:50 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand,
	Peter Xu

On Sun, Aug 10, 2025 at 8:55 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Lokesh,
>
>
> On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> >
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> >
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > includes TLB-flush overhead.
> >
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Barry Song <v-songbaohua@oppo.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> > Changes since v3 [1]
> > - Fix unintialized 'step_size' warning, per Dan Carpenter
> > - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> > - Removed flush_cache_range() in zero-page case, per Peter Xu
> > - Added comment to explain why folio reference for batched pages is not
> >   required, per Peter Xu
> > - Use MIN() in calculation of largest extent that can be batched under
> >   the same src and dst PTLs, per Peter Xu
> > - Release first folio's reference in move_present_ptes(), per Peter Xu
> >
> > Changes since v2 [2]
> > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > - Added check to ensure all batched pages share the same anon_vma
> >
> > Changes since v1 [3]
> > - Removed flush_tlb_batched_pending(), per Barry Song
> > - Unified single and multi page case, per Barry Song
> >
> > [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> > [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> >
> >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 127 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index cbed91b09640..39d81d2972db 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> >  }
> >
> > -static int move_present_pte(struct mm_struct *mm,
> > -                           struct vm_area_struct *dst_vma,
> > -                           struct vm_area_struct *src_vma,
> > -                           unsigned long dst_addr, unsigned long src_addr,
> > -                           pte_t *dst_pte, pte_t *src_pte,
> > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                           struct folio *src_folio)
> > +/*
> > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > + *
> > + * NOTE: folio's reference is not required as the whole operation is within
> > + * PTL's critical section.
> > + */
> > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > +                                                unsigned long src_addr,
> > +                                                pte_t *src_pte, pte_t *dst_pte,
> > +                                                struct anon_vma *src_anon_vma)
> > +{
> > +       pte_t orig_dst_pte, orig_src_pte;
> > +       struct folio *folio;
> > +
> > +       orig_dst_pte = ptep_get(dst_pte);
> > +       if (!pte_none(orig_dst_pte))
> > +               return NULL;
> > +
> > +       orig_src_pte = ptep_get(src_pte);
> > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > +               return NULL;
> > +
> > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > +       if (!folio || !folio_trylock(folio))
> > +               return NULL;
> > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > +           folio_anon_vma(folio) != src_anon_vma) {
> > +               folio_unlock(folio);
> > +               return NULL;
> > +       }
> > +       return folio;
> > +}
> > +
>
> I’m still quite confused by the code. Before move_present_ptes(), we’ve
> already performed all the checks—pte_same(), vm_normal_folio(),
> folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> and anon_vma_lock_write()—at least for the first PTE. Now we’re
> duplicating them again for all PTEs. Does this mean we’re doing those
> operations for the first PTE twice? It feels like the old non-batch check
> code should be removed?
>
> > +static long move_present_ptes(struct mm_struct *mm,
> > +                             struct vm_area_struct *dst_vma,
> > +                             struct vm_area_struct *src_vma,
> > +                             unsigned long dst_addr, unsigned long src_addr,
> > +                             pte_t *dst_pte, pte_t *src_pte,
> > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                             struct folio **first_src_folio, unsigned long len,
> > +                             struct anon_vma *src_anon_vma)
> >  {
> >         int err = 0;
> > +       struct folio *src_folio = *first_src_folio;
> > +       unsigned long src_start = src_addr;
> > +       unsigned long addr_end;
> > +
> > +       if (len > PAGE_SIZE) {
> > +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - dst_addr, len);
> >
> > +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - src_addr, len);
> > +       }
>
> We already have a pmd_addr_end() helper—can we reuse it?
>
> [...]
>
> >         /*
> > @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> >                         err = -ENOENT;
> >                 else /* nothing to do to move a hole */
> > -                       err = 0;
> > +                       err = PAGE_SIZE;
>
> To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
> code to make it more readable?
>
Agreed! I'll replace 'err' with 'ret' as the function no longer only
returns error but also bytes-moved if there is no error.

> [...]
>
> > @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >                                 break;
> >                         }
> >
> > -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> > -                                            dst_vma, src_vma,
> > -                                            dst_addr, src_addr, mode);
> > -                       step_size = PAGE_SIZE;
> > +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > +                                             dst_vma, src_vma, dst_addr,
> > +                                             src_addr, src_end - src_addr, mode);
> > +                       if (ret < 0)
> > +                               err = ret;
> > +                       else
> > +                               step_size = ret;
>
> also looks a bit strange :-)

Any suggestions on how to improve this? 'step_size' is expected to be
different in each iteration of the loop even without this patch.
>
> >                 }
> >
> >                 cond_resched();
> >
> > base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> > --
> > 2.50.1.703.g449372360f-goog
> >
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-12 15:44     ` Lokesh Gidra
@ 2025-08-13  9:03       ` Barry Song
  2025-08-13 14:12         ` Lokesh Gidra
  0 siblings, 1 reply; 9+ messages in thread
From: Barry Song @ 2025-08-13  9:03 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Peter Xu, akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > Hi Lokesh,
[...]
> > > >
> > > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index cbed91b09640..39d81d2972db 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > >  }
> > > >
> > > > -static int move_present_pte(struct mm_struct *mm,
> > > > -                           struct vm_area_struct *dst_vma,
> > > > -                           struct vm_area_struct *src_vma,
> > > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > -                           struct folio *src_folio)
> > > > +/*
> > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > + *
> > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > + * PTL's critical section.
> > > > + */
> > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > +                                                unsigned long src_addr,
> > > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > > +                                                struct anon_vma *src_anon_vma)
> > > > +{
> > > > +       pte_t orig_dst_pte, orig_src_pte;
> > > > +       struct folio *folio;
> > > > +
> > > > +       orig_dst_pte = ptep_get(dst_pte);
> > > > +       if (!pte_none(orig_dst_pte))
> > > > +               return NULL;
> > > > +
> > > > +       orig_src_pte = ptep_get(src_pte);
> > > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > +               return NULL;
> > > > +
> > > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > +       if (!folio || !folio_trylock(folio))
> > > > +               return NULL;
> > > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > > +               folio_unlock(folio);
> > > > +               return NULL;
> > > > +       }
> > > > +       return folio;
> > > > +}
> > > > +
> > >
> > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > operations for the first PTE twice? It feels like the old non-batch check
> > > code should be removed?
> >
> > This function should only start to work on the 2nd (or more) continuous
> > ptes to move within the same pgtable lock held.  We'll still need the
> > original path because that was sleepable, this one isn't, and it's only
> > best-effort fast path only. E.g. if trylock() fails above, it would
> > fallback to the slow path.
> >
> Thanks Peter. I was about to give exactly the same reasoning :)

Apologies, I overlooked this part:
                src_addr += PAGE_SIZE;
                if (src_addr == addr_end)
                        break;
                dst_addr += PAGE_SIZE;
                dst_pte++;
                src_pte++;
                folio_unlock(src_folio);
                src_folio = check_ptes_for_batched_move(src_vma,
src_addr, src_pte,
                                                        dst_pte, src_anon_vma);

I still find this a little tricky to follow — couldn’t we just handle it
like the other batched cases:

static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
                        struct page_vma_mapped_walk *pvmw,
                        enum ttu_flags flags, pte_t pte)

We pass the first PTE and use a function to determine how many PTEs we
can batch together. That way, we don’t need a special path for the first
PTE.

I guess the challenge is that the first PTE needs to handle
split_folio(), folio_trylock() with -EAGAIN, and
anon_vma_trylock_write(), while the other PTEs don’t?

If so, could we add a clear comment explaining that move_present_ptes()
moves PTEs that share the same anon_vma as the first PTE, are not large
folios, and can successfully take folio_trylock()?
If this condition isn’t met, the batch stops.

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-12 15:50   ` Lokesh Gidra
@ 2025-08-13  9:29     ` Barry Song
  2025-08-13 14:15       ` Lokesh Gidra
  0 siblings, 1 reply; 9+ messages in thread
From: Barry Song @ 2025-08-13  9:29 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand,
	Peter Xu

On Tue, Aug 12, 2025 at 11:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:

> > [...]
> >
> > >         /*
> > > @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> > >                         err = -ENOENT;
> > >                 else /* nothing to do to move a hole */
> > > -                       err = 0;
> > > +                       err = PAGE_SIZE;
> >
> > To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
> > code to make it more readable?
> >
> Agreed! I'll replace 'err' with 'ret' as the function no longer only
> returns error but also bytes-moved if there is no error.
>

Looks good. Should we also include the following?

--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1246,6 +1246,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
 /*
  * The mmap_lock for reading is held by the caller. Just move the page(s)
  * from src_pmd to dst_pmd if possible, and return number of bytes moved.
+ * On failure, an error code is returned instead
  */
 static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd,
pmd_t *src_pmd,
                            struct vm_area_struct *dst_vma,


> > [...]
> >
> > > @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > >                                 break;
> > >                         }
> > >
> > > -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> > > -                                            dst_vma, src_vma,
> > > -                                            dst_addr, src_addr, mode);
> > > -                       step_size = PAGE_SIZE;
> > > +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > > +                                             dst_vma, src_vma, dst_addr,
> > > +                                             src_addr, src_end - src_addr, mode);
> > > +                       if (ret < 0)
> > > +                               err = ret;
> > > +                       else
> > > +                               step_size = ret;
> >
> > also looks a bit strange :-)
>
> Any suggestions on how to improve this? 'step_size' is expected to be
> different in each iteration of the loop even without this patch.

Usually, we have:

if (ret < 0) {
    goto or break things;
}
step_size = ret;

Given the context, it does seem quite tricky to handle. I’m not sure,
so maybe your code is fine. :-)

> >
> > >                 }
> > >
> > >                 cond_resched();
> > >
> > > base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> > > --
> > > 2.50.1.703.g449372360f-goog

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-13  9:03       ` Barry Song
@ 2025-08-13 14:12         ` Lokesh Gidra
  0 siblings, 0 replies; 9+ messages in thread
From: Lokesh Gidra @ 2025-08-13 14:12 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Xu, akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Wed, Aug 13, 2025 at 2:03 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > > Hi Lokesh,
> [...]
> > > > >
> > > > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index cbed91b09640..39d81d2972db 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > > >  }
> > > > >
> > > > > -static int move_present_pte(struct mm_struct *mm,
> > > > > -                           struct vm_area_struct *dst_vma,
> > > > > -                           struct vm_area_struct *src_vma,
> > > > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > > -                           struct folio *src_folio)
> > > > > +/*
> > > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > > + *
> > > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > > + * PTL's critical section.
> > > > > + */
> > > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > > +                                                unsigned long src_addr,
> > > > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > > > +                                                struct anon_vma *src_anon_vma)
> > > > > +{
> > > > > +       pte_t orig_dst_pte, orig_src_pte;
> > > > > +       struct folio *folio;
> > > > > +
> > > > > +       orig_dst_pte = ptep_get(dst_pte);
> > > > > +       if (!pte_none(orig_dst_pte))
> > > > > +               return NULL;
> > > > > +
> > > > > +       orig_src_pte = ptep_get(src_pte);
> > > > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > > +               return NULL;
> > > > > +
> > > > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > > +       if (!folio || !folio_trylock(folio))
> > > > > +               return NULL;
> > > > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > > > +               folio_unlock(folio);
> > > > > +               return NULL;
> > > > > +       }
> > > > > +       return folio;
> > > > > +}
> > > > > +
> > > >
> > > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > > operations for the first PTE twice? It feels like the old non-batch check
> > > > code should be removed?
> > >
> > > This function should only start to work on the 2nd (or more) continuous
> > > ptes to move within the same pgtable lock held.  We'll still need the
> > > original path because that was sleepable, this one isn't, and it's only
> > > best-effort fast path only. E.g. if trylock() fails above, it would
> > > fallback to the slow path.
> > >
> > Thanks Peter. I was about to give exactly the same reasoning :)
>
> Apologies, I overlooked this part:
>                 src_addr += PAGE_SIZE;
>                 if (src_addr == addr_end)
>                         break;
>                 dst_addr += PAGE_SIZE;
>                 dst_pte++;
>                 src_pte++;
>                 folio_unlock(src_folio);
>                 src_folio = check_ptes_for_batched_move(src_vma,
> src_addr, src_pte,
>                                                         dst_pte, src_anon_vma);
>
> I still find this a little tricky to follow — couldn’t we just handle it
> like the other batched cases:
>
> static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>                         struct page_vma_mapped_walk *pvmw,
>                         enum ttu_flags flags, pte_t pte)
>
> We pass the first PTE and use a function to determine how many PTEs we
> can batch together. That way, we don’t need a special path for the first
> PTE.
>
> I guess the challenge is that the first PTE needs to handle
> split_folio(), folio_trylock() with -EAGAIN, and
> anon_vma_trylock_write(), while the other PTEs don’t?

That's right. We need to keep the complicated dance in
move_pages_pte(). That's why, unfortunately, we can't unify the way
you are hoping.
>
> If so, could we add a clear comment explaining that move_present_ptes()
> moves PTEs that share the same anon_vma as the first PTE, are not large
> folios, and can successfully take folio_trylock()?
> If this condition isn’t met, the batch stops.

Certainly. I'll add a description comment for move_present_ptes() to
explain this.
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-13  9:29     ` Barry Song
@ 2025-08-13 14:15       ` Lokesh Gidra
  0 siblings, 0 replies; 9+ messages in thread
From: Lokesh Gidra @ 2025-08-13 14:15 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, aarcange, linux-mm, linux-kernel, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand,
	Peter Xu

On Wed, Aug 13, 2025 at 2:29 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Aug 12, 2025 at 11:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> > > [...]
> > >
> > > >         /*
> > > > @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> > > >                         err = -ENOENT;
> > > >                 else /* nothing to do to move a hole */
> > > > -                       err = 0;
> > > > +                       err = PAGE_SIZE;
> > >
> > > To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
> > > code to make it more readable?
> > >
> > Agreed! I'll replace 'err' with 'ret' as the function no longer only
> > returns error but also bytes-moved if there is no error.
> >
>
> Looks good. Should we also include the following?
>
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1246,6 +1246,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
>  /*
>   * The mmap_lock for reading is held by the caller. Just move the page(s)
>   * from src_pmd to dst_pmd if possible, and return number of bytes moved.
> + * On failure, an error code is returned instead
>   */
>  static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd,
> pmd_t *src_pmd,
>                             struct vm_area_struct *dst_vma,
>
>
Of course. I'll add this.
> > > [...]
> > >
> > > > @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > > >                                 break;
> > > >                         }
> > > >
> > > > -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> > > > -                                            dst_vma, src_vma,
> > > > -                                            dst_addr, src_addr, mode);
> > > > -                       step_size = PAGE_SIZE;
> > > > +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > > > +                                             dst_vma, src_vma, dst_addr,
> > > > +                                             src_addr, src_end - src_addr, mode);
> > > > +                       if (ret < 0)
> > > > +                               err = ret;
> > > > +                       else
> > > > +                               step_size = ret;
> > >
> > > also looks a bit strange :-)
> >
> > Any suggestions on how to improve this? 'step_size' is expected to be
> > different in each iteration of the loop even without this patch.
>
> Usually, we have:
>
> if (ret < 0) {
>     goto or break things;
> }
> step_size = ret;
>
> Given the context, it does seem quite tricky to handle. I’m not sure,
> so maybe your code is fine. :-)

Yeah, the special handling for -EAGAIN warrants the current
implementation. I'll keep it as is then.

Thanks
>
> > >
> > > >                 }
> > > >
> > > >                 cond_resched();
> > > >
> > > > base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> > > > --
> > > > 2.50.1.703.g449372360f-goog
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-13 14:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-10  6:29 [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE Lokesh Gidra
2025-08-11  3:55 ` Barry Song
2025-08-12 14:44   ` Peter Xu
2025-08-12 15:44     ` Lokesh Gidra
2025-08-13  9:03       ` Barry Song
2025-08-13 14:12         ` Lokesh Gidra
2025-08-12 15:50   ` Lokesh Gidra
2025-08-13  9:29     ` Barry Song
2025-08-13 14:15       ` Lokesh Gidra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).