[PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
@ 2025-08-07 10:39 Lokesh Gidra
  2025-08-07 19:16 ` Peter Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Lokesh Gidra @ 2025-08-07 10:39 UTC (permalink / raw)
  To: akpm
  Cc: aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Lokesh Gidra, Suren Baghdasaryan, Kalesh Singh, Barry Song,
	David Hildenbrand, Peter Xu

MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
for moving present pages. Mitigate this cost by opportunistically
batching present contiguous pages for TLB flushing.

Without batching, in our testing on an arm64 Android device with UFFD GC,
which uses MOVE ioctl for compaction, we observed that out of the total
time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
~20% in vm_normal_folio().

With batching, the proportion of vm_normal_folio() increases to over
70% of move_pages_pte() without any changes to vm_normal_folio().
Furthermore, time spent within move_pages_pte() is only ~20%, which
includes TLB-flush overhead.

Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
Changes since v2 [1]
- Addressed VM_WARN_ON failure, per Lorenzo Stoakes
- Added check to ensure all batched pages share the same anon_vma

Changes since v1 [2]
- Removed flush_tlb_batched_pending(), per Barry Song
- Unified single and multi page case, per Barry Song

[1] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
[2] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/

 mm/userfaultfd.c | 179 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 128 insertions(+), 51 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index cbed91b09640..78c732100aec 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
 	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
 }
 
-static int move_present_pte(struct mm_struct *mm,
-			    struct vm_area_struct *dst_vma,
-			    struct vm_area_struct *src_vma,
-			    unsigned long dst_addr, unsigned long src_addr,
-			    pte_t *dst_pte, pte_t *src_pte,
-			    pte_t orig_dst_pte, pte_t orig_src_pte,
-			    pmd_t *dst_pmd, pmd_t dst_pmdval,
-			    spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			    struct folio *src_folio)
+/*
+ * Checks if the two ptes and the corresponding folio are eligible for batched
+ * move. If so, then returns pointer to the folio, after locking it. Otherwise,
+ * returns NULL.
+ */
+static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
+						 unsigned long src_addr,
+						 pte_t *src_pte, pte_t *dst_pte,
+						 struct anon_vma *src_anon_vma)
+{
+	pte_t orig_dst_pte, orig_src_pte;
+	struct folio *folio;
+
+	orig_dst_pte = ptep_get(dst_pte);
+	if (!pte_none(orig_dst_pte))
+		return NULL;
+
+	orig_src_pte = ptep_get(src_pte);
+	if (pte_none(orig_src_pte) || !pte_present(orig_src_pte) ||
+	    is_zero_pfn(pte_pfn(orig_src_pte)))
+		return NULL;
+
+	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
+	if (!folio || !folio_trylock(folio))
+		return NULL;
+	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
+	    folio_anon_vma(folio) != src_anon_vma) {
+		folio_unlock(folio);
+		return NULL;
+	}
+	return folio;
+}
+
+static long move_present_ptes(struct mm_struct *mm,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr, unsigned long src_addr,
+			      pte_t *dst_pte, pte_t *src_pte,
+			      pte_t orig_dst_pte, pte_t orig_src_pte,
+			      pmd_t *dst_pmd, pmd_t dst_pmdval,
+			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
+			      struct folio *src_folio, unsigned long len,
+			      struct anon_vma *src_anon_vma)
 {
 	int err = 0;
+	unsigned long src_start = src_addr;
+	unsigned long addr_end;
 
+	if (len > PAGE_SIZE) {
+		addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
+		if (dst_addr + len > addr_end)
+			len = addr_end - dst_addr;
+
+		addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
+		if (src_addr + len > addr_end)
+			len = addr_end - src_addr;
+	}
+	flush_cache_range(src_vma, src_addr, src_addr + len);
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1051,31 +1097,54 @@ static int move_present_pte(struct mm_struct *mm,
 		err = -EBUSY;
 		goto out;
 	}
+	arch_enter_lazy_mmu_mode();
+
+	addr_end = src_start + len;
+	while (true) {
+		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
+		/* Folio got pinned from under us. Put it back and fail the move. */
+		if (folio_maybe_dma_pinned(src_folio)) {
+			set_pte_at(mm, src_addr, src_pte, orig_src_pte);
+			err = -EBUSY;
+			break;
+		}
 
-	orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
-	/* Folio got pinned from under us. Put it back and fail the move. */
-	if (folio_maybe_dma_pinned(src_folio)) {
-		set_pte_at(mm, src_addr, src_pte, orig_src_pte);
-		err = -EBUSY;
-		goto out;
-	}
-
-	folio_move_anon_rmap(src_folio, dst_vma);
-	src_folio->index = linear_page_index(dst_vma, dst_addr);
+		folio_move_anon_rmap(src_folio, dst_vma);
+		src_folio->index = linear_page_index(dst_vma, dst_addr);
 
-	orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
-	/* Set soft dirty bit so userspace can notice the pte was moved */
+		orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
+		/* Set soft dirty bit so userspace can notice the pte was moved */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-	orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
+		orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
 #endif
-	if (pte_dirty(orig_src_pte))
-		orig_dst_pte = pte_mkdirty(orig_dst_pte);
-	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+		if (pte_dirty(orig_src_pte))
+			orig_dst_pte = pte_mkdirty(orig_dst_pte);
+		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		src_addr += PAGE_SIZE;
+		if (src_addr == addr_end)
+			break;
+		src_pte++;
+		dst_pte++;
+
+		folio_unlock(src_folio);
+		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
+							dst_pte, src_anon_vma);
+		if (!src_folio)
+			break;
+		dst_addr += PAGE_SIZE;
+	}
+
+	arch_leave_lazy_mmu_mode();
+	if (src_addr > src_start)
+		flush_tlb_range(src_vma, src_start, src_addr);
 
-	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
 out:
 	double_pt_unlock(dst_ptl, src_ptl);
-	return err;
+	if (src_folio)
+		folio_unlock(src_folio);
+	return src_addr > src_start ? src_addr - src_start : err;
 }
 
 static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
@@ -1140,7 +1209,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 	set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
 	double_pt_unlock(dst_ptl, src_ptl);
 
-	return 0;
+	return PAGE_SIZE;
 }
 
 static int move_zeropage_pte(struct mm_struct *mm,
@@ -1154,6 +1223,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
 {
 	pte_t zero_pte;
 
+	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
 	double_pt_lock(dst_ptl, src_ptl);
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
 				 dst_pmd, dst_pmdval)) {
@@ -1167,20 +1237,19 @@ static int move_zeropage_pte(struct mm_struct *mm,
 	set_pte_at(mm, dst_addr, dst_pte, zero_pte);
 	double_pt_unlock(dst_ptl, src_ptl);
 
-	return 0;
+	return PAGE_SIZE;
 }
 
 
 /*
- * The mmap_lock for reading is held by the caller. Just move the page
- * from src_pmd to dst_pmd if possible, and return true if succeeded
- * in moving the page.
+ * The mmap_lock for reading is held by the caller. Just move the page(s)
+ * from src_pmd to dst_pmd if possible, and return number of bytes moved.
  */
-static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
-			  struct vm_area_struct *dst_vma,
-			  struct vm_area_struct *src_vma,
-			  unsigned long dst_addr, unsigned long src_addr,
-			  __u64 mode)
+static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
+			    struct vm_area_struct *dst_vma,
+			    struct vm_area_struct *src_vma,
+			    unsigned long dst_addr, unsigned long src_addr,
+			    unsigned long len, __u64 mode)
 {
 	swp_entry_t entry;
 	struct swap_info_struct *si = NULL;
@@ -1196,9 +1265,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	struct mmu_notifier_range range;
 	int err = 0;
 
-	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
-				src_addr, src_addr + PAGE_SIZE);
+				src_addr, src_addr + len);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
 	/*
@@ -1257,7 +1325,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
 			err = -ENOENT;
 		else /* nothing to do to move a hole */
-			err = 0;
+			err = PAGE_SIZE;
 		goto out;
 	}
 
@@ -1375,10 +1443,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			}
 		}
 
-		err = move_present_pte(mm,  dst_vma, src_vma,
-				       dst_addr, src_addr, dst_pte, src_pte,
-				       orig_dst_pte, orig_src_pte, dst_pmd,
-				       dst_pmdval, dst_ptl, src_ptl, src_folio);
+		err = move_present_ptes(mm, dst_vma, src_vma,
+					dst_addr, src_addr, dst_pte, src_pte,
+					orig_dst_pte, orig_src_pte, dst_pmd,
+					dst_pmdval, dst_ptl, src_ptl, src_folio,
+					len, src_anon_vma);
+		/* folio is already unlocked by move_present_ptes() */
+		folio_put(src_folio);
+		src_folio = NULL;
 	} else {
 		struct folio *folio = NULL;
 
@@ -1732,7 +1804,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 {
 	struct mm_struct *mm = ctx->mm;
 	struct vm_area_struct *src_vma, *dst_vma;
-	unsigned long src_addr, dst_addr;
+	unsigned long src_addr, dst_addr, src_end;
 	pmd_t *src_pmd, *dst_pmd;
 	long err = -EINVAL;
 	ssize_t moved = 0;
@@ -1775,8 +1847,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 	if (err)
 		goto out_unlock;
 
-	for (src_addr = src_start, dst_addr = dst_start;
-	     src_addr < src_start + len;) {
+	for (src_addr = src_start, dst_addr = dst_start, src_end = src_start + len;
+	     src_addr < src_end;) {
 		spinlock_t *ptl;
 		pmd_t dst_pmdval;
 		unsigned long step_size;
@@ -1841,6 +1913,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 						  dst_addr, src_addr);
 			step_size = HPAGE_PMD_SIZE;
 		} else {
+			long ret;
+
 			if (pmd_none(*src_pmd)) {
 				if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
 					err = -ENOENT;
@@ -1857,10 +1931,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				break;
 			}
 
-			err = move_pages_pte(mm, dst_pmd, src_pmd,
-					     dst_vma, src_vma,
-					     dst_addr, src_addr, mode);
-			step_size = PAGE_SIZE;
+			ret = move_pages_ptes(mm, dst_pmd, src_pmd,
+					      dst_vma, src_vma, dst_addr,
+					      src_addr, src_end - src_addr, mode);
+			if (ret > 0)
+				step_size = ret;
+			else
+				err = ret;
 		}
 
 		cond_resched();

base-commit: 6e64f4580381e32c06ee146ca807c555b8f73e24
-- 
2.50.1.565.gc32cd1483b-goog



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-07 10:39 [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE Lokesh Gidra
@ 2025-08-07 19:16 ` Peter Xu
  2025-08-07 22:54   ` Andrew Morton
  2025-08-08 16:29   ` Lokesh Gidra
  0 siblings, 2 replies; 8+ messages in thread
From: Peter Xu @ 2025-08-07 19:16 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

Hi, Lokesh,

On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> for moving present pages. Mitigate this cost by opportunistically
> batching present contiguous pages for TLB flushing.
> 
> Without batching, in our testing on an arm64 Android device with UFFD GC,
> which uses MOVE ioctl for compaction, we observed that out of the total
> time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> ~20% in vm_normal_folio().
> 
> With batching, the proportion of vm_normal_folio() increases to over
> 70% of move_pages_pte() without any changes to vm_normal_folio().

Do you know why vm_normal_folio() could be expensive? I still see quite
some other things this path needs to do.

> Furthermore, time spent within move_pages_pte() is only ~20%, which
> includes TLB-flush overhead.

Indeed this should already prove the optimization, I'm just curious whether
you've run some benchmark on the GC app to show the real world benefit.

> 
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
> Changes since v2 [1]
> - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> - Added check to ensure all batched pages share the same anon_vma
> 
> Changes since v1 [2]
> - Removed flush_tlb_batched_pending(), per Barry Song
> - Unified single and multi page case, per Barry Song
> 
> [1] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> [2] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> 
>  mm/userfaultfd.c | 179 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 128 insertions(+), 51 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index cbed91b09640..78c732100aec 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>  	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
>  }
>  
> -static int move_present_pte(struct mm_struct *mm,
> -			    struct vm_area_struct *dst_vma,
> -			    struct vm_area_struct *src_vma,
> -			    unsigned long dst_addr, unsigned long src_addr,
> -			    pte_t *dst_pte, pte_t *src_pte,
> -			    pte_t orig_dst_pte, pte_t orig_src_pte,
> -			    pmd_t *dst_pmd, pmd_t dst_pmdval,
> -			    spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			    struct folio *src_folio)
> +/*
> + * Checks if the two ptes and the corresponding folio are eligible for batched
> + * move. If so, then returns pointer to the folio, after locking it. Otherwise,
> + * returns NULL.
> + */
> +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> +						 unsigned long src_addr,
> +						 pte_t *src_pte, pte_t *dst_pte,
> +						 struct anon_vma *src_anon_vma)
> +{
> +	pte_t orig_dst_pte, orig_src_pte;
> +	struct folio *folio;
> +
> +	orig_dst_pte = ptep_get(dst_pte);
> +	if (!pte_none(orig_dst_pte))
> +		return NULL;
> +
> +	orig_src_pte = ptep_get(src_pte);
> +	if (pte_none(orig_src_pte) || !pte_present(orig_src_pte) ||

pte_none() check could be removed - the pte_present() check should make
sure it's !none.

> +	    is_zero_pfn(pte_pfn(orig_src_pte)))
> +		return NULL;
> +
> +	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> +	if (!folio || !folio_trylock(folio))
> +		return NULL;

So here we don't take a refcount anymore, while the 1st folio that got
passed in will still has the refcount boosted.  IMHO it would still be
better to keep the behavior the same on the 1st and continuous folios..

Or if this is intentional, maybe worth some comment.  More below on this..

> +	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> +	    folio_anon_vma(folio) != src_anon_vma) {
> +		folio_unlock(folio);
> +		return NULL;
> +	}
> +	return folio;
> +}
> +
> +static long move_present_ptes(struct mm_struct *mm,
> +			      struct vm_area_struct *dst_vma,
> +			      struct vm_area_struct *src_vma,
> +			      unsigned long dst_addr, unsigned long src_addr,
> +			      pte_t *dst_pte, pte_t *src_pte,
> +			      pte_t orig_dst_pte, pte_t orig_src_pte,
> +			      pmd_t *dst_pmd, pmd_t dst_pmdval,
> +			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
> +			      struct folio *src_folio, unsigned long len,
> +			      struct anon_vma *src_anon_vma)

(Not an immediate concern, but this function has potential to win the
 max-num-of-parameters kernel function.. :)

>  {
>  	int err = 0;
> +	unsigned long src_start = src_addr;
> +	unsigned long addr_end;
>  
> +	if (len > PAGE_SIZE) {
> +		addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> +		if (dst_addr + len > addr_end)
> +			len = addr_end - dst_addr;

Use something like ALIGN() and MIN()?

> +
> +		addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> +		if (src_addr + len > addr_end)
> +			len = addr_end - src_addr;

Same here.

> +	}
> +	flush_cache_range(src_vma, src_addr, src_addr + len);
>  	double_pt_lock(dst_ptl, src_ptl);
>  
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1051,31 +1097,54 @@ static int move_present_pte(struct mm_struct *mm,
>  		err = -EBUSY;
>  		goto out;
>  	}
> +	arch_enter_lazy_mmu_mode();
> +
> +	addr_end = src_start + len;
> +	while (true) {
> +		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> +		/* Folio got pinned from under us. Put it back and fail the move. */
> +		if (folio_maybe_dma_pinned(src_folio)) {
> +			set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> +			err = -EBUSY;
> +			break;
> +		}
>  
> -	orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
> -	/* Folio got pinned from under us. Put it back and fail the move. */
> -	if (folio_maybe_dma_pinned(src_folio)) {
> -		set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> -		err = -EBUSY;
> -		goto out;
> -	}
> -
> -	folio_move_anon_rmap(src_folio, dst_vma);
> -	src_folio->index = linear_page_index(dst_vma, dst_addr);
> +		folio_move_anon_rmap(src_folio, dst_vma);
> +		src_folio->index = linear_page_index(dst_vma, dst_addr);
>  
> -	orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> -	/* Set soft dirty bit so userspace can notice the pte was moved */
> +		orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> +		/* Set soft dirty bit so userspace can notice the pte was moved */
>  #ifdef CONFIG_MEM_SOFT_DIRTY
> -	orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
> +		orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
>  #endif
> -	if (pte_dirty(orig_src_pte))
> -		orig_dst_pte = pte_mkdirty(orig_dst_pte);
> -	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> +		if (pte_dirty(orig_src_pte))
> +			orig_dst_pte = pte_mkdirty(orig_dst_pte);
> +		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> +		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> +		src_addr += PAGE_SIZE;
> +		if (src_addr == addr_end)
> +			break;
> +		src_pte++;
> +		dst_pte++;
> +
> +		folio_unlock(src_folio);
> +		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> +							dst_pte, src_anon_vma);
> +		if (!src_folio)
> +			break;
> +		dst_addr += PAGE_SIZE;
> +	}
> +
> +	arch_leave_lazy_mmu_mode();
> +	if (src_addr > src_start)
> +		flush_tlb_range(src_vma, src_start, src_addr);
>  
> -	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
>  out:
>  	double_pt_unlock(dst_ptl, src_ptl);
> -	return err;
> +	if (src_folio)
> +		folio_unlock(src_folio);
> +	return src_addr > src_start ? src_addr - src_start : err;
>  }
>  
>  static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> @@ -1140,7 +1209,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  	set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
>  	double_pt_unlock(dst_ptl, src_ptl);
>  
> -	return 0;
> +	return PAGE_SIZE;
>  }
>  
>  static int move_zeropage_pte(struct mm_struct *mm,
> @@ -1154,6 +1223,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
>  {
>  	pte_t zero_pte;
>  
> +	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);

If it's a zero page hence not writtable, do we still need to flush cache at
all?  Looks harmless, but looks like not needed either.

>  	double_pt_lock(dst_ptl, src_ptl);
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
>  				 dst_pmd, dst_pmdval)) {
> @@ -1167,20 +1237,19 @@ static int move_zeropage_pte(struct mm_struct *mm,
>  	set_pte_at(mm, dst_addr, dst_pte, zero_pte);
>  	double_pt_unlock(dst_ptl, src_ptl);
>  
> -	return 0;
> +	return PAGE_SIZE;
>  }
>  
>  
>  /*
> - * The mmap_lock for reading is held by the caller. Just move the page
> - * from src_pmd to dst_pmd if possible, and return true if succeeded
> - * in moving the page.
> + * The mmap_lock for reading is held by the caller. Just move the page(s)
> + * from src_pmd to dst_pmd if possible, and return number of bytes moved.
>   */
> -static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> -			  struct vm_area_struct *dst_vma,
> -			  struct vm_area_struct *src_vma,
> -			  unsigned long dst_addr, unsigned long src_addr,
> -			  __u64 mode)
> +static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> +			    struct vm_area_struct *dst_vma,
> +			    struct vm_area_struct *src_vma,
> +			    unsigned long dst_addr, unsigned long src_addr,
> +			    unsigned long len, __u64 mode)
>  {
>  	swp_entry_t entry;
>  	struct swap_info_struct *si = NULL;
> @@ -1196,9 +1265,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  	struct mmu_notifier_range range;
>  	int err = 0;
>  
> -	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> -				src_addr, src_addr + PAGE_SIZE);
> +				src_addr, src_addr + len);
>  	mmu_notifier_invalidate_range_start(&range);
>  retry:
>  	/*
> @@ -1257,7 +1325,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
>  			err = -ENOENT;
>  		else /* nothing to do to move a hole */
> -			err = 0;
> +			err = PAGE_SIZE;
>  		goto out;
>  	}
>  
> @@ -1375,10 +1443,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  			}
>  		}
>  
> -		err = move_present_pte(mm,  dst_vma, src_vma,
> -				       dst_addr, src_addr, dst_pte, src_pte,
> -				       orig_dst_pte, orig_src_pte, dst_pmd,
> -				       dst_pmdval, dst_ptl, src_ptl, src_folio);
> +		err = move_present_ptes(mm, dst_vma, src_vma,
> +					dst_addr, src_addr, dst_pte, src_pte,
> +					orig_dst_pte, orig_src_pte, dst_pmd,
> +					dst_pmdval, dst_ptl, src_ptl, src_folio,
> +					len, src_anon_vma);
> +		/* folio is already unlocked by move_present_ptes() */
> +		folio_put(src_folio);
> +		src_folio = NULL;

So the function above now can move multiple folios but keep holding the
1st's refcount..  This still smells error prone, sooner or later.

Would it be slightly better if we take a folio pointer in
move_present_ptes(), and releae everything there (including reset the
pointer)?

Thanks,

>  	} else {
>  		struct folio *folio = NULL;
>  
> @@ -1732,7 +1804,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  {
>  	struct mm_struct *mm = ctx->mm;
>  	struct vm_area_struct *src_vma, *dst_vma;
> -	unsigned long src_addr, dst_addr;
> +	unsigned long src_addr, dst_addr, src_end;
>  	pmd_t *src_pmd, *dst_pmd;
>  	long err = -EINVAL;
>  	ssize_t moved = 0;
> @@ -1775,8 +1847,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  	if (err)
>  		goto out_unlock;
>  
> -	for (src_addr = src_start, dst_addr = dst_start;
> -	     src_addr < src_start + len;) {
> +	for (src_addr = src_start, dst_addr = dst_start, src_end = src_start + len;
> +	     src_addr < src_end;) {
>  		spinlock_t *ptl;
>  		pmd_t dst_pmdval;
>  		unsigned long step_size;
> @@ -1841,6 +1913,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  						  dst_addr, src_addr);
>  			step_size = HPAGE_PMD_SIZE;
>  		} else {
> +			long ret;
> +
>  			if (pmd_none(*src_pmd)) {
>  				if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
>  					err = -ENOENT;
> @@ -1857,10 +1931,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  				break;
>  			}
>  
> -			err = move_pages_pte(mm, dst_pmd, src_pmd,
> -					     dst_vma, src_vma,
> -					     dst_addr, src_addr, mode);
> -			step_size = PAGE_SIZE;
> +			ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> +					      dst_vma, src_vma, dst_addr,
> +					      src_addr, src_end - src_addr, mode);
> +			if (ret > 0)
> +				step_size = ret;
> +			else
> +				err = ret;
>  		}
>  
>  		cond_resched();
> 
> base-commit: 6e64f4580381e32c06ee146ca807c555b8f73e24
> -- 
> 2.50.1.565.gc32cd1483b-goog
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-07 19:16 ` Peter Xu
@ 2025-08-07 22:54   ` Andrew Morton
  2025-08-08 16:41     ` Lokesh Gidra
  2025-08-08 16:29   ` Lokesh Gidra
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2025-08-07 22:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lokesh Gidra, aarcange, linux-mm, linux-kernel, 21cnbao,
	ngeoffray, Suren Baghdasaryan, Kalesh Singh, Barry Song,
	David Hildenbrand

On Thu, 7 Aug 2025 15:16:57 -0400 Peter Xu <peterx@redhat.com> wrote:

> Hi, Lokesh,
> 
> On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> > 
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> > 
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
> 
> Do you know why vm_normal_folio() could be expensive? I still see quite
> some other things this path needs to do.

Maybe as explained here?
https://lkml.kernel.org/r/20250807185819.199865-1-lorenzo.stoakes@oracle.com




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-07 19:16 ` Peter Xu
  2025-08-07 22:54   ` Andrew Morton
@ 2025-08-08 16:29   ` Lokesh Gidra
  2025-08-10  6:31     ` Lokesh Gidra
  2025-08-11 14:00     ` Peter Xu
  1 sibling, 2 replies; 8+ messages in thread
From: Lokesh Gidra @ 2025-08-08 16:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Thu, Aug 7, 2025 at 12:17 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Lokesh,
>
> On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> >
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> >
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
>
> Do you know why vm_normal_folio() could be expensive? I still see quite
> some other things this path needs to do.
>
Let's discuss this in Andrew's reply thread.

> > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > includes TLB-flush overhead.
>
> Indeed this should already prove the optimization, I'm just curious whether
> you've run some benchmark on the GC app to show the real world benefit.
>
I did! The same benchmark through which I gathered these numbers, when
run on cuttlefish (qemu android instance on x86_64), the completion
time of the benchmark went down from ~45mins to ~20mins. The benchmark
is very GC intensive and the overhead of IPI on vCPUs seems to be
enormous leading to this drastic improvement.

In another instance, system_server, one of the most critical system
processes on android, saw over 50% reduction in GC compaction time on
an arm64 android device.

> >
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Barry Song <v-songbaohua@oppo.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> > Changes since v2 [1]
> > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > - Added check to ensure all batched pages share the same anon_vma
> >
> > Changes since v1 [2]
> > - Removed flush_tlb_batched_pending(), per Barry Song
> > - Unified single and multi page case, per Barry Song
> >
> > [1] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > [2] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> >
> >  mm/userfaultfd.c | 179 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 128 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index cbed91b09640..78c732100aec 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> >              pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> >  }
> >
> > -static int move_present_pte(struct mm_struct *mm,
> > -                         struct vm_area_struct *dst_vma,
> > -                         struct vm_area_struct *src_vma,
> > -                         unsigned long dst_addr, unsigned long src_addr,
> > -                         pte_t *dst_pte, pte_t *src_pte,
> > -                         pte_t orig_dst_pte, pte_t orig_src_pte,
> > -                         pmd_t *dst_pmd, pmd_t dst_pmdval,
> > -                         spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                         struct folio *src_folio)
> > +/*
> > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > + * move. If so, then returns pointer to the folio, after locking it. Otherwise,
> > + * returns NULL.
> > + */
> > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > +                                              unsigned long src_addr,
> > +                                              pte_t *src_pte, pte_t *dst_pte,
> > +                                              struct anon_vma *src_anon_vma)
> > +{
> > +     pte_t orig_dst_pte, orig_src_pte;
> > +     struct folio *folio;
> > +
> > +     orig_dst_pte = ptep_get(dst_pte);
> > +     if (!pte_none(orig_dst_pte))
> > +             return NULL;
> > +
> > +     orig_src_pte = ptep_get(src_pte);
> > +     if (pte_none(orig_src_pte) || !pte_present(orig_src_pte) ||
>
> pte_none() check could be removed - the pte_present() check should make
> sure it's !none.
>
Makes sense. I'll make the change in the next version of the patch.

> > +         is_zero_pfn(pte_pfn(orig_src_pte)))
> > +             return NULL;
> > +
> > +     folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > +     if (!folio || !folio_trylock(folio))
> > +             return NULL;
>
> So here we don't take a refcount anymore, while the 1st folio that got
> passed in will still has the refcount boosted.  IMHO it would still be
> better to keep the behavior the same on the 1st and continuous folios..
>
> Or if this is intentional, maybe worth some comment.  More below on this..
>
This is indeed intentional, and I'll add a comment in the next
version. But let me explain:

The first folio needed the refcount as we need to pin the page before
releasing the src ptl. Also, because split_folio(), if called, expects
the caller to hold the lock as well as reference on the folio.

The subsequent folios in the batch are always within the ptl critical
section and neither the splits are required. Therefore, I didn't want
to unnecessarily increase the work done within the critical section.
But, please correct me if I'm misunderstanding something.

> > +     if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > +         folio_anon_vma(folio) != src_anon_vma) {
> > +             folio_unlock(folio);
> > +             return NULL;
> > +     }
> > +     return folio;
> > +}
> > +
> > +static long move_present_ptes(struct mm_struct *mm,
> > +                           struct vm_area_struct *dst_vma,
> > +                           struct vm_area_struct *src_vma,
> > +                           unsigned long dst_addr, unsigned long src_addr,
> > +                           pte_t *dst_pte, pte_t *src_pte,
> > +                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > +                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > +                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                           struct folio *src_folio, unsigned long len,
> > +                           struct anon_vma *src_anon_vma)
>
> (Not an immediate concern, but this function has potential to win the
>  max-num-of-parameters kernel function.. :)

I noticed the same when I made the change :) Maybe in a subsequent
patch if we inline is_pte_pages_stable() and PTL acquire/release in
move_pages_ptes(), then quite a few parameters can be reduced. But
then, IMO, even move_pages_ptes() refactoring is required as well.
>
> >  {
> >       int err = 0;
> > +     unsigned long src_start = src_addr;
> > +     unsigned long addr_end;
> >
> > +     if (len > PAGE_SIZE) {
> > +             addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > +             if (dst_addr + len > addr_end)
> > +                     len = addr_end - dst_addr;
>
> Use something like ALIGN() and MIN()?

Will do in the next version.
>
> > +
> > +             addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > +             if (src_addr + len > addr_end)
> > +                     len = addr_end - src_addr;
>
> Same here.
>
Will do.

> > +     }
> > +     flush_cache_range(src_vma, src_addr, src_addr + len);
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1051,31 +1097,54 @@ static int move_present_pte(struct mm_struct *mm,
> >               err = -EBUSY;
> >               goto out;
> >       }
> > +     arch_enter_lazy_mmu_mode();
> > +
> > +     addr_end = src_start + len;
> > +     while (true) {
> > +             orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > +             /* Folio got pinned from under us. Put it back and fail the move. */
> > +             if (folio_maybe_dma_pinned(src_folio)) {
> > +                     set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> > +                     err = -EBUSY;
> > +                     break;
> > +             }
> >
> > -     orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
> > -     /* Folio got pinned from under us. Put it back and fail the move. */
> > -     if (folio_maybe_dma_pinned(src_folio)) {
> > -             set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> > -             err = -EBUSY;
> > -             goto out;
> > -     }
> > -
> > -     folio_move_anon_rmap(src_folio, dst_vma);
> > -     src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +             folio_move_anon_rmap(src_folio, dst_vma);
> > +             src_folio->index = linear_page_index(dst_vma, dst_addr);
> >
> > -     orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> > -     /* Set soft dirty bit so userspace can notice the pte was moved */
> > +             orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> > +             /* Set soft dirty bit so userspace can notice the pte was moved */
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > -     orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
> > +             orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
> >  #endif
> > -     if (pte_dirty(orig_src_pte))
> > -             orig_dst_pte = pte_mkdirty(orig_dst_pte);
> > -     orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> > +             if (pte_dirty(orig_src_pte))
> > +                     orig_dst_pte = pte_mkdirty(orig_dst_pte);
> > +             orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> > +             set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> > +
> > +             src_addr += PAGE_SIZE;
> > +             if (src_addr == addr_end)
> > +                     break;
> > +             src_pte++;
> > +             dst_pte++;
> > +
> > +             folio_unlock(src_folio);
> > +             src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> > +                                                     dst_pte, src_anon_vma);
> > +             if (!src_folio)
> > +                     break;
> > +             dst_addr += PAGE_SIZE;
> > +     }
> > +
> > +     arch_leave_lazy_mmu_mode();
> > +     if (src_addr > src_start)
> > +             flush_tlb_range(src_vma, src_start, src_addr);
> >
> > -     set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> >  out:
> >       double_pt_unlock(dst_ptl, src_ptl);
> > -     return err;
> > +     if (src_folio)
> > +             folio_unlock(src_folio);
> > +     return src_addr > src_start ? src_addr - src_start : err;
> >  }
> >
> >  static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > @@ -1140,7 +1209,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >       set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> >       double_pt_unlock(dst_ptl, src_ptl);
> >
> > -     return 0;
> > +     return PAGE_SIZE;
> >  }
> >
> >  static int move_zeropage_pte(struct mm_struct *mm,
> > @@ -1154,6 +1223,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
> >  {
> >       pte_t zero_pte;
> >
> > +     flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
>
> If it's a zero page hence not writtable, do we still need to flush cache at
> all?  Looks harmless, but looks like not needed either.
>
I just realized when reading your comment that it is indeed not
required. There is no cacheline to be flushed for the zero-page :)

> >       double_pt_lock(dst_ptl, src_ptl);
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> >                                dst_pmd, dst_pmdval)) {
> > @@ -1167,20 +1237,19 @@ static int move_zeropage_pte(struct mm_struct *mm,
> >       set_pte_at(mm, dst_addr, dst_pte, zero_pte);
> >       double_pt_unlock(dst_ptl, src_ptl);
> >
> > -     return 0;
> > +     return PAGE_SIZE;
> >  }
> >
> >
> >  /*
> > - * The mmap_lock for reading is held by the caller. Just move the page
> > - * from src_pmd to dst_pmd if possible, and return true if succeeded
> > - * in moving the page.
> > + * The mmap_lock for reading is held by the caller. Just move the page(s)
> > + * from src_pmd to dst_pmd if possible, and return number of bytes moved.
> >   */
> > -static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > -                       struct vm_area_struct *dst_vma,
> > -                       struct vm_area_struct *src_vma,
> > -                       unsigned long dst_addr, unsigned long src_addr,
> > -                       __u64 mode)
> > +static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > +                         struct vm_area_struct *dst_vma,
> > +                         struct vm_area_struct *src_vma,
> > +                         unsigned long dst_addr, unsigned long src_addr,
> > +                         unsigned long len, __u64 mode)
> >  {
> >       swp_entry_t entry;
> >       struct swap_info_struct *si = NULL;
> > @@ -1196,9 +1265,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >       struct mmu_notifier_range range;
> >       int err = 0;
> >
> > -     flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
> >       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> > -                             src_addr, src_addr + PAGE_SIZE);
> > +                             src_addr, src_addr + len);
> >       mmu_notifier_invalidate_range_start(&range);
> >  retry:
> >       /*
> > @@ -1257,7 +1325,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >               if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> >                       err = -ENOENT;
> >               else /* nothing to do to move a hole */
> > -                     err = 0;
> > +                     err = PAGE_SIZE;
> >               goto out;
> >       }
> >
> > @@ -1375,10 +1443,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                       }
> >               }
> >
> > -             err = move_present_pte(mm,  dst_vma, src_vma,
> > -                                    dst_addr, src_addr, dst_pte, src_pte,
> > -                                    orig_dst_pte, orig_src_pte, dst_pmd,
> > -                                    dst_pmdval, dst_ptl, src_ptl, src_folio);
> > +             err = move_present_ptes(mm, dst_vma, src_vma,
> > +                                     dst_addr, src_addr, dst_pte, src_pte,
> > +                                     orig_dst_pte, orig_src_pte, dst_pmd,
> > +                                     dst_pmdval, dst_ptl, src_ptl, src_folio,
> > +                                     len, src_anon_vma);
> > +             /* folio is already unlocked by move_present_ptes() */
> > +             folio_put(src_folio);
> > +             src_folio = NULL;
>
> So the function above now can move multiple folios but keep holding the
> 1st's refcount..  This still smells error prone, sooner or later.
>
> Would it be slightly better if we take a folio pointer in
> move_present_ptes(), and releae everything there (including reset the
> pointer)?
>
Yeah this seems cleaner. I'll do so in the next version.

> Thanks,
>
> >       } else {
> >               struct folio *folio = NULL;
> >
> > @@ -1732,7 +1804,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >  {
> >       struct mm_struct *mm = ctx->mm;
> >       struct vm_area_struct *src_vma, *dst_vma;
> > -     unsigned long src_addr, dst_addr;
> > +     unsigned long src_addr, dst_addr, src_end;
> >       pmd_t *src_pmd, *dst_pmd;
> >       long err = -EINVAL;
> >       ssize_t moved = 0;
> > @@ -1775,8 +1847,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >       if (err)
> >               goto out_unlock;
> >
> > -     for (src_addr = src_start, dst_addr = dst_start;
> > -          src_addr < src_start + len;) {
> > +     for (src_addr = src_start, dst_addr = dst_start, src_end = src_start + len;
> > +          src_addr < src_end;) {
> >               spinlock_t *ptl;
> >               pmd_t dst_pmdval;
> >               unsigned long step_size;
> > @@ -1841,6 +1913,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >                                                 dst_addr, src_addr);
> >                       step_size = HPAGE_PMD_SIZE;
> >               } else {
> > +                     long ret;
> > +
> >                       if (pmd_none(*src_pmd)) {
> >                               if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
> >                                       err = -ENOENT;
> > @@ -1857,10 +1931,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >                               break;
> >                       }
> >
> > -                     err = move_pages_pte(mm, dst_pmd, src_pmd,
> > -                                          dst_vma, src_vma,
> > -                                          dst_addr, src_addr, mode);
> > -                     step_size = PAGE_SIZE;
> > +                     ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > +                                           dst_vma, src_vma, dst_addr,
> > +                                           src_addr, src_end - src_addr, mode);
> > +                     if (ret > 0)
> > +                             step_size = ret;
> > +                     else
> > +                             err = ret;
> >               }
> >
> >               cond_resched();
> >
> > base-commit: 6e64f4580381e32c06ee146ca807c555b8f73e24
> > --
> > 2.50.1.565.gc32cd1483b-goog
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-07 22:54   ` Andrew Morton
@ 2025-08-08 16:41     ` Lokesh Gidra
  0 siblings, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2025-08-08 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Thu, Aug 7, 2025 at 3:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 7 Aug 2025 15:16:57 -0400 Peter Xu <peterx@redhat.com> wrote:
>
> > Hi, Lokesh,
> >
> > On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > for moving present pages. Mitigate this cost by opportunistically
> > > batching present contiguous pages for TLB flushing.
> > >
> > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > ~20% in vm_normal_folio().
> > >
> > > With batching, the proportion of vm_normal_folio() increases to over
> > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> >
> > Do you know why vm_normal_folio() could be expensive? I still see quite
> > some other things this path needs to do.
>
> Maybe as explained here?
> https://lkml.kernel.org/r/20250807185819.199865-1-lorenzo.stoakes@oracle.com
>
Thanks for sharing this, Andrew. IMHO, this seems like the most likely
reason to me. There is nothing there other than a cold access to the
page struct.

>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-08 16:29   ` Lokesh Gidra
@ 2025-08-10  6:31     ` Lokesh Gidra
  2025-08-11 14:00     ` Peter Xu
  1 sibling, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2025-08-10  6:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Fri, Aug 8, 2025 at 9:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Thu, Aug 7, 2025 at 12:17 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Lokesh,
> >
> > On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > for moving present pages. Mitigate this cost by opportunistically
> > > batching present contiguous pages for TLB flushing.
> > >
> > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > ~20% in vm_normal_folio().
> > >
> > > With batching, the proportion of vm_normal_folio() increases to over
> > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> >
> > Do you know why vm_normal_folio() could be expensive? I still see quite
> > some other things this path needs to do.
> >
> Let's discuss this in Andrew's reply thread.
>
> > > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > > includes TLB-flush overhead.
> >
> > Indeed this should already prove the optimization, I'm just curious whether
> > you've run some benchmark on the GC app to show the real world benefit.
> >
> I did! The same benchmark through which I gathered these numbers, when
> run on cuttlefish (qemu android instance on x86_64), the completion
> time of the benchmark went down from ~45mins to ~20mins. The benchmark
> is very GC intensive and the overhead of IPI on vCPUs seems to be
> enormous leading to this drastic improvement.
>
> In another instance, system_server, one of the most critical system
> processes on android, saw over 50% reduction in GC compaction time on
> an arm64 android device.
>
> > >
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > Cc: Barry Song <v-songbaohua@oppo.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > ---
> > > Changes since v2 [1]
> > > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > > - Added check to ensure all batched pages share the same anon_vma
> > >
> > > Changes since v1 [2]
> > > - Removed flush_tlb_batched_pending(), per Barry Song
> > > - Unified single and multi page case, per Barry Song
> > >
> > > [1] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > > [2] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> > >
> > >  mm/userfaultfd.c | 179 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 128 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index cbed91b09640..78c732100aec 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > >              pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > >  }
> > >
> > > -static int move_present_pte(struct mm_struct *mm,
> > > -                         struct vm_area_struct *dst_vma,
> > > -                         struct vm_area_struct *src_vma,
> > > -                         unsigned long dst_addr, unsigned long src_addr,
> > > -                         pte_t *dst_pte, pte_t *src_pte,
> > > -                         pte_t orig_dst_pte, pte_t orig_src_pte,
> > > -                         pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > -                         spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > -                         struct folio *src_folio)
> > > +/*
> > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > + * move. If so, then returns pointer to the folio, after locking it. Otherwise,
> > > + * returns NULL.
> > > + */
> > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > +                                              unsigned long src_addr,
> > > +                                              pte_t *src_pte, pte_t *dst_pte,
> > > +                                              struct anon_vma *src_anon_vma)
> > > +{
> > > +     pte_t orig_dst_pte, orig_src_pte;
> > > +     struct folio *folio;
> > > +
> > > +     orig_dst_pte = ptep_get(dst_pte);
> > > +     if (!pte_none(orig_dst_pte))
> > > +             return NULL;
> > > +
> > > +     orig_src_pte = ptep_get(src_pte);
> > > +     if (pte_none(orig_src_pte) || !pte_present(orig_src_pte) ||
> >
> > pte_none() check could be removed - the pte_present() check should make
> > sure it's !none.
> >
> Makes sense. I'll make the change in the next version of the patch.
>
> > > +         is_zero_pfn(pte_pfn(orig_src_pte)))
> > > +             return NULL;
> > > +
> > > +     folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > +     if (!folio || !folio_trylock(folio))
> > > +             return NULL;
> >
> > So here we don't take a refcount anymore, while the 1st folio that got
> > passed in will still has the refcount boosted.  IMHO it would still be
> > better to keep the behavior the same on the 1st and continuous folios..
> >
> > Or if this is intentional, maybe worth some comment.  More below on this..
> >
> This is indeed intentional, and I'll add a comment in the next
> version. But let me explain:
>
> The first folio needed the refcount as we need to pin the page before
> releasing the src ptl. Also, because split_folio(), if called, expects
> the caller to hold the lock as well as reference on the folio.
>
> The subsequent folios in the batch are always within the ptl critical
> section and neither the splits are required. Therefore, I didn't want
> to unnecessarily increase the work done within the critical section.
> But, please correct me if I'm misunderstanding something.
>
> > > +     if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > +         folio_anon_vma(folio) != src_anon_vma) {
> > > +             folio_unlock(folio);
> > > +             return NULL;
> > > +     }
> > > +     return folio;
> > > +}
> > > +
> > > +static long move_present_ptes(struct mm_struct *mm,
> > > +                           struct vm_area_struct *dst_vma,
> > > +                           struct vm_area_struct *src_vma,
> > > +                           unsigned long dst_addr, unsigned long src_addr,
> > > +                           pte_t *dst_pte, pte_t *src_pte,
> > > +                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > +                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > +                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > +                           struct folio *src_folio, unsigned long len,
> > > +                           struct anon_vma *src_anon_vma)
> >
> > (Not an immediate concern, but this function has potential to win the
> >  max-num-of-parameters kernel function.. :)
>
> I noticed the same when I made the change :) Maybe in a subsequent
> patch if we inline is_pte_pages_stable() and PTL acquire/release in
> move_pages_ptes(), then quite a few parameters can be reduced. But
> then, IMO, even move_pages_ptes() refactoring is required as well.
> >
> > >  {
> > >       int err = 0;
> > > +     unsigned long src_start = src_addr;
> > > +     unsigned long addr_end;
> > >
> > > +     if (len > PAGE_SIZE) {
> > > +             addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > > +             if (dst_addr + len > addr_end)
> > > +                     len = addr_end - dst_addr;
> >
> > Use something like ALIGN() and MIN()?
>
> Will do in the next version.
> >
> > > +
> > > +             addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > > +             if (src_addr + len > addr_end)
> > > +                     len = addr_end - src_addr;
> >
> > Same here.
> >
> Will do.
>
> > > +     }
> > > +     flush_cache_range(src_vma, src_addr, src_addr + len);
> > >       double_pt_lock(dst_ptl, src_ptl);
> > >
> > >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > @@ -1051,31 +1097,54 @@ static int move_present_pte(struct mm_struct *mm,
> > >               err = -EBUSY;
> > >               goto out;
> > >       }
> > > +     arch_enter_lazy_mmu_mode();
> > > +
> > > +     addr_end = src_start + len;
> > > +     while (true) {
> > > +             orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > > +             /* Folio got pinned from under us. Put it back and fail the move. */
> > > +             if (folio_maybe_dma_pinned(src_folio)) {
> > > +                     set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> > > +                     err = -EBUSY;
> > > +                     break;
> > > +             }
> > >
> > > -     orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
> > > -     /* Folio got pinned from under us. Put it back and fail the move. */
> > > -     if (folio_maybe_dma_pinned(src_folio)) {
> > > -             set_pte_at(mm, src_addr, src_pte, orig_src_pte);
> > > -             err = -EBUSY;
> > > -             goto out;
> > > -     }
> > > -
> > > -     folio_move_anon_rmap(src_folio, dst_vma);
> > > -     src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +             folio_move_anon_rmap(src_folio, dst_vma);
> > > +             src_folio->index = linear_page_index(dst_vma, dst_addr);
> > >
> > > -     orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> > > -     /* Set soft dirty bit so userspace can notice the pte was moved */
> > > +             orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
> > > +             /* Set soft dirty bit so userspace can notice the pte was moved */
> > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > > -     orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
> > > +             orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
> > >  #endif
> > > -     if (pte_dirty(orig_src_pte))
> > > -             orig_dst_pte = pte_mkdirty(orig_dst_pte);
> > > -     orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> > > +             if (pte_dirty(orig_src_pte))
> > > +                     orig_dst_pte = pte_mkdirty(orig_dst_pte);
> > > +             orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> > > +             set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> > > +
> > > +             src_addr += PAGE_SIZE;
> > > +             if (src_addr == addr_end)
> > > +                     break;
> > > +             src_pte++;
> > > +             dst_pte++;
> > > +
> > > +             folio_unlock(src_folio);
> > > +             src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> > > +                                                     dst_pte, src_anon_vma);
> > > +             if (!src_folio)
> > > +                     break;
> > > +             dst_addr += PAGE_SIZE;
> > > +     }
> > > +
> > > +     arch_leave_lazy_mmu_mode();
> > > +     if (src_addr > src_start)
> > > +             flush_tlb_range(src_vma, src_start, src_addr);
> > >
> > > -     set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> > >  out:
> > >       double_pt_unlock(dst_ptl, src_ptl);
> > > -     return err;
> > > +     if (src_folio)
> > > +             folio_unlock(src_folio);
> > > +     return src_addr > src_start ? src_addr - src_start : err;
> > >  }
> > >
> > >  static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > @@ -1140,7 +1209,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >       set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> > >       double_pt_unlock(dst_ptl, src_ptl);
> > >
> > > -     return 0;
> > > +     return PAGE_SIZE;
> > >  }
> > >
> > >  static int move_zeropage_pte(struct mm_struct *mm,
> > > @@ -1154,6 +1223,7 @@ static int move_zeropage_pte(struct mm_struct *mm,
> > >  {
> > >       pte_t zero_pte;
> > >
> > > +     flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
> >
> > If it's a zero page hence not writtable, do we still need to flush cache at
> > all?  Looks harmless, but looks like not needed either.
> >
> I just realized when reading your comment that it is indeed not
> required. There is no cacheline to be flushed for the zero-page :)
>
> > >       double_pt_lock(dst_ptl, src_ptl);
> > >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > >                                dst_pmd, dst_pmdval)) {
> > > @@ -1167,20 +1237,19 @@ static int move_zeropage_pte(struct mm_struct *mm,
> > >       set_pte_at(mm, dst_addr, dst_pte, zero_pte);
> > >       double_pt_unlock(dst_ptl, src_ptl);
> > >
> > > -     return 0;
> > > +     return PAGE_SIZE;
> > >  }
> > >
> > >
> > >  /*
> > > - * The mmap_lock for reading is held by the caller. Just move the page
> > > - * from src_pmd to dst_pmd if possible, and return true if succeeded
> > > - * in moving the page.
> > > + * The mmap_lock for reading is held by the caller. Just move the page(s)
> > > + * from src_pmd to dst_pmd if possible, and return number of bytes moved.
> > >   */
> > > -static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > -                       struct vm_area_struct *dst_vma,
> > > -                       struct vm_area_struct *src_vma,
> > > -                       unsigned long dst_addr, unsigned long src_addr,
> > > -                       __u64 mode)
> > > +static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > +                         struct vm_area_struct *dst_vma,
> > > +                         struct vm_area_struct *src_vma,
> > > +                         unsigned long dst_addr, unsigned long src_addr,
> > > +                         unsigned long len, __u64 mode)
> > >  {
> > >       swp_entry_t entry;
> > >       struct swap_info_struct *si = NULL;
> > > @@ -1196,9 +1265,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >       struct mmu_notifier_range range;
> > >       int err = 0;
> > >
> > > -     flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
> > >       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> > > -                             src_addr, src_addr + PAGE_SIZE);
> > > +                             src_addr, src_addr + len);
> > >       mmu_notifier_invalidate_range_start(&range);
> > >  retry:
> > >       /*
> > > @@ -1257,7 +1325,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >               if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> > >                       err = -ENOENT;
> > >               else /* nothing to do to move a hole */
> > > -                     err = 0;
> > > +                     err = PAGE_SIZE;
> > >               goto out;
> > >       }
> > >
> > > @@ -1375,10 +1443,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                       }
> > >               }
> > >
> > > -             err = move_present_pte(mm,  dst_vma, src_vma,
> > > -                                    dst_addr, src_addr, dst_pte, src_pte,
> > > -                                    orig_dst_pte, orig_src_pte, dst_pmd,
> > > -                                    dst_pmdval, dst_ptl, src_ptl, src_folio);
> > > +             err = move_present_ptes(mm, dst_vma, src_vma,
> > > +                                     dst_addr, src_addr, dst_pte, src_pte,
> > > +                                     orig_dst_pte, orig_src_pte, dst_pmd,
> > > +                                     dst_pmdval, dst_ptl, src_ptl, src_folio,
> > > +                                     len, src_anon_vma);
> > > +             /* folio is already unlocked by move_present_ptes() */
> > > +             folio_put(src_folio);
> > > +             src_folio = NULL;
> >
> > So the function above now can move multiple folios but keep holding the
> > 1st's refcount..  This still smells error prone, sooner or later.
> >
> > Would it be slightly better if we take a folio pointer in
> > move_present_ptes(), and releae everything there (including reset the
> > pointer)?
> >
> Yeah this seems cleaner. I'll do so in the next version.
>
Uploaded v4 addressing all the comments here:
https://lore.kernel.org/all/20250810062912.1096815-1-lokeshgidra@google.com/

Thanks.
> > Thanks,
> >
> > >       } else {
> > >               struct folio *folio = NULL;
> > >
> > > @@ -1732,7 +1804,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > >  {
> > >       struct mm_struct *mm = ctx->mm;
> > >       struct vm_area_struct *src_vma, *dst_vma;
> > > -     unsigned long src_addr, dst_addr;
> > > +     unsigned long src_addr, dst_addr, src_end;
> > >       pmd_t *src_pmd, *dst_pmd;
> > >       long err = -EINVAL;
> > >       ssize_t moved = 0;
> > > @@ -1775,8 +1847,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > >       if (err)
> > >               goto out_unlock;
> > >
> > > -     for (src_addr = src_start, dst_addr = dst_start;
> > > -          src_addr < src_start + len;) {
> > > +     for (src_addr = src_start, dst_addr = dst_start, src_end = src_start + len;
> > > +          src_addr < src_end;) {
> > >               spinlock_t *ptl;
> > >               pmd_t dst_pmdval;
> > >               unsigned long step_size;
> > > @@ -1841,6 +1913,8 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > >                                                 dst_addr, src_addr);
> > >                       step_size = HPAGE_PMD_SIZE;
> > >               } else {
> > > +                     long ret;
> > > +
> > >                       if (pmd_none(*src_pmd)) {
> > >                               if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
> > >                                       err = -ENOENT;
> > > @@ -1857,10 +1931,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > >                               break;
> > >                       }
> > >
> > > -                     err = move_pages_pte(mm, dst_pmd, src_pmd,
> > > -                                          dst_vma, src_vma,
> > > -                                          dst_addr, src_addr, mode);
> > > -                     step_size = PAGE_SIZE;
> > > +                     ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > > +                                           dst_vma, src_vma, dst_addr,
> > > +                                           src_addr, src_end - src_addr, mode);
> > > +                     if (ret > 0)
> > > +                             step_size = ret;
> > > +                     else
> > > +                             err = ret;
> > >               }
> > >
> > >               cond_resched();
> > >
> > > base-commit: 6e64f4580381e32c06ee146ca807c555b8f73e24
> > > --
> > > 2.50.1.565.gc32cd1483b-goog
> > >
> >
> > --
> > Peter Xu
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-08 16:29   ` Lokesh Gidra
  2025-08-10  6:31     ` Lokesh Gidra
@ 2025-08-11 14:00     ` Peter Xu
  2025-08-12 14:01       ` Lokesh Gidra
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Xu @ 2025-08-11 14:00 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Fri, Aug 08, 2025 at 09:29:58AM -0700, Lokesh Gidra wrote:
> On Thu, Aug 7, 2025 at 12:17 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Lokesh,
> >
> > On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > for moving present pages. Mitigate this cost by opportunistically
> > > batching present contiguous pages for TLB flushing.
> > >
> > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > ~20% in vm_normal_folio().
> > >
> > > With batching, the proportion of vm_normal_folio() increases to over
> > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> >
> > Do you know why vm_normal_folio() could be expensive? I still see quite
> > some other things this path needs to do.
> >
> Let's discuss this in Andrew's reply thread.

Sorry to get back to this late.  Thanks for the link, Andrew!

> 
> > > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > > includes TLB-flush overhead.
> >
> > Indeed this should already prove the optimization, I'm just curious whether
> > you've run some benchmark on the GC app to show the real world benefit.
> >
> I did! The same benchmark through which I gathered these numbers, when
> run on cuttlefish (qemu android instance on x86_64), the completion
> time of the benchmark went down from ~45mins to ~20mins. The benchmark
> is very GC intensive and the overhead of IPI on vCPUs seems to be
> enormous leading to this drastic improvement.
> 
> In another instance, system_server, one of the most critical system
> processes on android, saw over 50% reduction in GC compaction time on
> an arm64 android device.

Would you mind add some of these numbers into the commit message when you
repost?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE
  2025-08-11 14:00     ` Peter Xu
@ 2025-08-12 14:01       ` Lokesh Gidra
  0 siblings, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2025-08-12 14:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, aarcange, linux-mm, linux-kernel, 21cnbao, ngeoffray,
	Suren Baghdasaryan, Kalesh Singh, Barry Song, David Hildenbrand

On Mon, Aug 11, 2025 at 7:00 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Aug 08, 2025 at 09:29:58AM -0700, Lokesh Gidra wrote:
> > On Thu, Aug 7, 2025 at 12:17 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Lokesh,
> > >
> > > On Thu, Aug 07, 2025 at 03:39:02AM -0700, Lokesh Gidra wrote:
> > > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > > for moving present pages. Mitigate this cost by opportunistically
> > > > batching present contiguous pages for TLB flushing.
> > > >
> > > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > > ~20% in vm_normal_folio().
> > > >
> > > > With batching, the proportion of vm_normal_folio() increases to over
> > > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > >
> > > Do you know why vm_normal_folio() could be expensive? I still see quite
> > > some other things this path needs to do.
> > >
> > Let's discuss this in Andrew's reply thread.
>
> Sorry to get back to this late.  Thanks for the link, Andrew!
>
> >
> > > > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > > > includes TLB-flush overhead.
> > >
> > > Indeed this should already prove the optimization, I'm just curious whether
> > > you've run some benchmark on the GC app to show the real world benefit.
> > >
> > I did! The same benchmark through which I gathered these numbers, when
> > run on cuttlefish (qemu android instance on x86_64), the completion
> > time of the benchmark went down from ~45mins to ~20mins. The benchmark
> > is very GC intensive and the overhead of IPI on vCPUs seems to be
> > enormous leading to this drastic improvement.
> >
> > In another instance, system_server, one of the most critical system
> > processes on android, saw over 50% reduction in GC compaction time on
> > an arm64 android device.
>
> Would you mind add some of these numbers into the commit message when you
> repost?

I can certainly do that in v5. But I already sent v4 addressing all
your other feedback. I'll incorporate these numbers as well as any
additional changes suggested in v4 in v5.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-12 14:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 10:39 [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE Lokesh Gidra
2025-08-07 19:16 ` Peter Xu
2025-08-07 22:54   ` Andrew Morton
2025-08-08 16:41     ` Lokesh Gidra
2025-08-08 16:29   ` Lokesh Gidra
2025-08-10  6:31     ` Lokesh Gidra
2025-08-11 14:00     ` Peter Xu
2025-08-12 14:01       ` Lokesh Gidra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).