[PATCH v4] mm: userfaultfd: fix race of userfaultfd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
@ 2025-06-04 15:10 Kairui Song
  2025-06-04 15:34 ` Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Kairui Song @ 2025-06-04 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

On seeing a swap entry PTE, userfaultfd_move does a lockless swap
cache lookup, and tries to move the found folio to the faulting vma.
Currently, it relies on checking the PTE value to ensure that the moved
folio still belongs to the src swap entry and that no new folio has
been added to the swap cache, which turns out to be unreliable.

While working and reviewing the swap table series with Barry, following
existing races are observed and reproduced [1]:

In the example below, move_pages_pte is moving src_pte to dst_pte,
where src_pte is a swap entry PTE holding swap entry S1, and S1
is not in the swap cache:

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1
    ... < interrupted> ...
                                   <swapin src_pte, alloc and use folio A>
                                   // folio A is a new allocated folio
                                   // and get installed into src_pte
                                   <frees swap entry S1>
                                   // src_pte now points to folio A, S1
                                   // has swap count == 0, it can be freed
                                   // by folio_swap_swap or swap
                                   // allocator's reclaim.
                                   <try to swap out another folio B>
                                   // folio B is a folio in another VMA.
                                   <put folio B to swap cache using S1 >
                                   // S1 is freed, folio B can use it
                                   // for swap out with no problem.
                                   ...
    folio = filemap_get_folio(S1)
    // Got folio B here !!!
    ... < interrupted again> ...
                                   <swapin folio B and free S1>
                                   // Now S1 is free to be used again.
                                   <swapout src_pte & folio A using S1>
                                   // Now src_pte is a swap entry PTE
                                   // holding S1 again.
    folio_trylock(folio)
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // Moved invalid folio B here !!!

The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it
can be reproduced easily.

This can be fixed by checking if the folio returned by filemap is the
valid swap cache folio after acquiring the folio lock.

Another similar race is possible: filemap_get_folio may return NULL, but
folio (A) could be swapped in and then swapped out again using the same
swap entry after the lookup. In such a case, folio (A) may remain in the
swap cache, so it must be moved too:

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1, and S1 is not in swap cache
    folio = filemap_get_folio(S1)
    // Got NULL
    ... < interrupted again> ...
                                   <swapin folio A and free S1>
                                   <swapout folio A re-using S1>
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // folio A is ignored !!!

Fix this by checking the swap cache again after acquiring the src_pte
lock. And to avoid the filemap overhead, we check swap_map directly [2].

The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
far we don't need to worry about that, since folios can only be exposed
to the swap cache in the swap out path, and this is covered in this
patch by checking the swap cache again after acquiring the src_pte lock.

Testing with a simple C program that allocates and moves several GB of
memory did not show any observable performance change.

Cc: <stable@vger.kernel.org>
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>

---

V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
Changes:
- Check swap_map instead of doing a filemap lookup after acquiring the
  PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]

V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
Changes:
- Move the folio and swap check inside move_swap_pte to avoid skipping
  the check and potential overhead [ Lokesh Gidra ]
- Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
  value.

V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
Changes:
- Add more comments and more context in commit message.

 mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..8253978ee0fb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 			 pte_t orig_dst_pte, pte_t orig_src_pte,
 			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			 struct folio *src_folio)
+			 struct folio *src_folio,
+			 struct swap_info_struct *si, swp_entry_t entry)
 {
+	/*
+	 * Check if the folio still belongs to the target swap entry after
+	 * acquiring the lock. Folio can be freed in the swap cache while
+	 * not locked.
+	 */
+	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+				  entry.val != src_folio->swap.val))
+		return -EAGAIN;
+
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 	if (src_folio) {
 		folio_move_anon_rmap(src_folio, dst_vma);
 		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	} else {
+		/*
+		 * Check if the swap entry is cached after acquiring the src_pte
+		 * lock. Otherwise, we might miss a newly loaded swap cache folio.
+		 *
+		 * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
+		 * We are trying to catch newly added swap cache, the only possible case is
+		 * when a folio is swapped in and out again staying in swap cache, using the
+		 * same entry before the PTE check above. The PTL is acquired and released
+		 * twice, each time after updating the swap_map's flag. So holding
+		 * the PTL here ensures we see the updated value. False positive is possible,
+		 * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
+		 * cache, or during the tiny synchronization window between swap cache and
+		 * swap_map, but it will be gone very quickly, worst result is retry jitters.
+		 */
+		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
 	}
 
 	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		}
 		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
 				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
-				dst_ptl, src_ptl, src_folio);
+				dst_ptl, src_ptl, src_folio, si, entry);
 	}
 
 out:
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-04 15:10 [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
@ 2025-06-04 15:34 ` Peter Xu
  2025-06-04 15:39   ` Suren Baghdasaryan
  2025-06-05  8:02 ` Barry Song
  2025-06-06  0:14 ` Chris Li
  2 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2025-06-04 15:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Wed, Jun 04, 2025 at 11:10:38PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap
> cache lookup, and tries to move the found folio to the faulting vma.
> Currently, it relies on checking the PTE value to ensure that the moved
> folio still belongs to the src swap entry and that no new folio has
> been added to the swap cache, which turns out to be unreliable.
> 
> While working and reviewing the swap table series with Barry, following
> existing races are observed and reproduced [1]:
> 
> In the example below, move_pages_pte is moving src_pte to dst_pte,
> where src_pte is a swap entry PTE holding swap entry S1, and S1
> is not in the swap cache:
> 
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B can use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry PTE
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
> 
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it
> can be reproduced easily.
> 
> This can be fixed by checking if the folio returned by filemap is the
> valid swap cache folio after acquiring the folio lock.
> 
> Another similar race is possible: filemap_get_folio may return NULL, but
> folio (A) could be swapped in and then swapped out again using the same
> swap entry after the lookup. In such a case, folio (A) may remain in the
> swap cache, so it must be moved too:
> 
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1, and S1 is not in swap cache
>     folio = filemap_get_folio(S1)
>     // Got NULL
>     ... < interrupted again> ...
>                                    <swapin folio A and free S1>
>                                    <swapout folio A re-using S1>
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // folio A is ignored !!!
> 
> Fix this by checking the swap cache again after acquiring the src_pte
> lock. And to avoid the filemap overhead, we check swap_map directly [2].
> 
> The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
> far we don't need to worry about that, since folios can only be exposed
> to the swap cache in the swap out path, and this is covered in this
> patch by checking the swap cache again after acquiring the src_pte lock.
> 
> Testing with a simple C program that allocates and moves several GB of
> memory did not show any observable performance change.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>
> 
> ---
> 
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> 
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
> 
> V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
> Changes:
> - Add more comments and more context in commit message.
> 
>  mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..8253978ee0fb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  			 pte_t orig_dst_pte, pte_t orig_src_pte,
>  			 pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			 struct folio *src_folio)
> +			 struct folio *src_folio,
> +			 struct swap_info_struct *si, swp_entry_t entry)
>  {
> +	/*
> +	 * Check if the folio still belongs to the target swap entry after
> +	 * acquiring the lock. Folio can be freed in the swap cache while
> +	 * not locked.
> +	 */
> +	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +				  entry.val != src_folio->swap.val))
> +		return -EAGAIN;
> +
>  	double_pt_lock(dst_ptl, src_ptl);
>  
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  	if (src_folio) {
>  		folio_move_anon_rmap(src_folio, dst_vma);
>  		src_folio->index = linear_page_index(dst_vma, dst_addr);
> +	} else {
> +		/*
> +		 * Check if the swap entry is cached after acquiring the src_pte
> +		 * lock. Otherwise, we might miss a newly loaded swap cache folio.
> +		 *
> +		 * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
> +		 * We are trying to catch newly added swap cache, the only possible case is
> +		 * when a folio is swapped in and out again staying in swap cache, using the
> +		 * same entry before the PTE check above. The PTL is acquired and released
> +		 * twice, each time after updating the swap_map's flag. So holding
> +		 * the PTL here ensures we see the updated value. False positive is possible,
> +		 * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
> +		 * cache, or during the tiny synchronization window between swap cache and
> +		 * swap_map, but it will be gone very quickly, worst result is retry jitters.
> +		 */

The comment above may not be the best I can think of, but I think I'm
already too harsh. :)  That's good enough to me.  It's also great to
mention the 2nd race too as Barry suggested in the commit log.

Thank you!

Acked-by: Peter Xu <peterx@redhat.com>

> +		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> +			double_pt_unlock(dst_ptl, src_ptl);
> +			return -EAGAIN;
> +		}
>  	}
>  
>  	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		}
>  		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>  				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> -				dst_ptl, src_ptl, src_folio);
> +				dst_ptl, src_ptl, src_folio, si, entry);
>  	}
>  
>  out:
> -- 
> 2.49.0
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-04 15:34 ` Peter Xu
@ 2025-06-04 15:39   ` Suren Baghdasaryan
  0 siblings, 0 replies; 7+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 15:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kairui Song, linux-mm, Andrew Morton, Barry Song,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Wed, Jun 4, 2025 at 8:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 04, 2025 at 11:10:38PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap
> > cache lookup, and tries to move the found folio to the faulting vma.
> > Currently, it relies on checking the PTE value to ensure that the moved
> > folio still belongs to the src swap entry and that no new folio has
> > been added to the swap cache, which turns out to be unreliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing races are observed and reproduced [1]:
> >
> > In the example below, move_pages_pte is moving src_pte to dst_pte,
> > where src_pte is a swap entry PTE holding swap entry S1, and S1
> > is not in the swap cache:
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B can use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry PTE
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it
> > can be reproduced easily.
> >
> > This can be fixed by checking if the folio returned by filemap is the
> > valid swap cache folio after acquiring the folio lock.
> >
> > Another similar race is possible: filemap_get_folio may return NULL, but
> > folio (A) could be swapped in and then swapped out again using the same
> > swap entry after the lookup. In such a case, folio (A) may remain in the
> > swap cache, so it must be moved too:
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1, and S1 is not in swap cache
> >     folio = filemap_get_folio(S1)
> >     // Got NULL
> >     ... < interrupted again> ...
> >                                    <swapin folio A and free S1>
> >                                    <swapout folio A re-using S1>
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // folio A is ignored !!!
> >
> > Fix this by checking the swap cache again after acquiring the src_pte
> > lock. And to avoid the filemap overhead, we check swap_map directly [2].
> >
> > The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
> > far we don't need to worry about that, since folios can only be exposed
> > to the swap cache in the swap out path, and this is covered in this
> > patch by checking the swap cache again after acquiring the src_pte lock.
> >
> > Testing with a simple C program that allocates and moves several GB of
> > memory did not show any observable performance change.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>

Very interesting races. Thanks for the fix!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> >
> > ---
> >
> > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > Changes:
> > - Check swap_map instead of doing a filemap lookup after acquiring the
> >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> >
> > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > Changes:
> > - Move the folio and swap check inside move_swap_pte to avoid skipping
> >   the check and potential overhead [ Lokesh Gidra ]
> > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> >   value.
> >
> > V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
> > Changes:
> > - Add more comments and more context in commit message.
> >
> >  mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..8253978ee0fb 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                        spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                      struct folio *src_folio)
> > +                      struct folio *src_folio,
> > +                      struct swap_info_struct *si, swp_entry_t entry)
> >  {
> > +     /*
> > +      * Check if the folio still belongs to the target swap entry after
> > +      * acquiring the lock. Folio can be freed in the swap cache while
> > +      * not locked.
> > +      */
> > +     if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > +                               entry.val != src_folio->swap.val))
> > +             return -EAGAIN;
> > +
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >       if (src_folio) {
> >               folio_move_anon_rmap(src_folio, dst_vma);
> >               src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +     } else {
> > +             /*
> > +              * Check if the swap entry is cached after acquiring the src_pte
> > +              * lock. Otherwise, we might miss a newly loaded swap cache folio.
> > +              *
> > +              * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
> > +              * We are trying to catch newly added swap cache, the only possible case is
> > +              * when a folio is swapped in and out again staying in swap cache, using the
> > +              * same entry before the PTE check above. The PTL is acquired and released
> > +              * twice, each time after updating the swap_map's flag. So holding
> > +              * the PTL here ensures we see the updated value. False positive is possible,
> > +              * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
> > +              * cache, or during the tiny synchronization window between swap cache and
> > +              * swap_map, but it will be gone very quickly, worst result is retry jitters.
> > +              */
>
> The comment above may not be the best I can think of, but I think I'm
> already too harsh. :)  That's good enough to me.  It's also great to
> mention the 2nd race too as Barry suggested in the commit log.
>
> Thank you!
>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> > +             if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> > +                     double_pt_unlock(dst_ptl, src_ptl);
> > +                     return -EAGAIN;
> > +             }
> >       }
> >
> >       orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > @@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >               }
> >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > -                             dst_ptl, src_ptl, src_folio);
> > +                             dst_ptl, src_ptl, src_folio, si, entry);
> >       }
> >
> >  out:
> > --
> > 2.49.0
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-04 15:10 [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
  2025-06-04 15:34 ` Peter Xu
@ 2025-06-05  8:02 ` Barry Song
  2025-06-06  0:14 ` Chris Li
  2 siblings, 0 replies; 7+ messages in thread
From: Barry Song @ 2025-06-05  8:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Thu, Jun 5, 2025 at 3:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap
> cache lookup, and tries to move the found folio to the faulting vma.
> Currently, it relies on checking the PTE value to ensure that the moved
> folio still belongs to the src swap entry and that no new folio has
> been added to the swap cache, which turns out to be unreliable.
>
> While working and reviewing the swap table series with Barry, following
> existing races are observed and reproduced [1]:
>
> In the example below, move_pages_pte is moving src_pte to dst_pte,
> where src_pte is a swap entry PTE holding swap entry S1, and S1
> is not in the swap cache:
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B can use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry PTE
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
>
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it
> can be reproduced easily.
>
> This can be fixed by checking if the folio returned by filemap is the
> valid swap cache folio after acquiring the folio lock.
>
> Another similar race is possible: filemap_get_folio may return NULL, but
> folio (A) could be swapped in and then swapped out again using the same
> swap entry after the lookup. In such a case, folio (A) may remain in the
> swap cache, so it must be moved too:
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1, and S1 is not in swap cache
>     folio = filemap_get_folio(S1)
>     // Got NULL
>     ... < interrupted again> ...
>                                    <swapin folio A and free S1>
>                                    <swapout folio A re-using S1>
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // folio A is ignored !!!
>
> Fix this by checking the swap cache again after acquiring the src_pte
> lock. And to avoid the filemap overhead, we check swap_map directly [2].
>
> The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
> far we don't need to worry about that, since folios can only be exposed
> to the swap cache in the swap out path, and this is covered in this
> patch by checking the swap cache again after acquiring the src_pte lock.
>
> Testing with a simple C program that allocates and moves several GB of
> memory did not show any observable performance change.
>
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
> Signed-off-by: Kairui Song <kasong@tencent.com>

Many thanks！

Reviewed-by: Barry Song <baohua@kernel.org>

> Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>
>
> ---
>
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
>
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
>
> V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
> Changes:
> - Add more comments and more context in commit message.
>
>  mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..8253978ee0fb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>                          pte_t orig_dst_pte, pte_t orig_src_pte,
>                          pmd_t *dst_pmd, pmd_t dst_pmdval,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -                        struct folio *src_folio)
> +                        struct folio *src_folio,
> +                        struct swap_info_struct *si, swp_entry_t entry)
>  {
> +       /*
> +        * Check if the folio still belongs to the target swap entry after
> +        * acquiring the lock. Folio can be freed in the swap cache while
> +        * not locked.
> +        */
> +       if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +                                 entry.val != src_folio->swap.val))
> +               return -EAGAIN;
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               /*
> +                * Check if the swap entry is cached after acquiring the src_pte
> +                * lock. Otherwise, we might miss a newly loaded swap cache folio.
> +                *
> +                * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
> +                * We are trying to catch newly added swap cache, the only possible case is
> +                * when a folio is swapped in and out again staying in swap cache, using the
> +                * same entry before the PTE check above. The PTL is acquired and released
> +                * twice, each time after updating the swap_map's flag. So holding
> +                * the PTL here ensures we see the updated value. False positive is possible,
> +                * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
> +                * cache, or during the tiny synchronization window between swap cache and
> +                * swap_map, but it will be gone very quickly, worst result is retry jitters.
> +                */
> +               if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
>         }
>
>         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                 }
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> -                               dst_ptl, src_ptl, src_folio);
> +                               dst_ptl, src_ptl, src_folio, si, entry);
>         }
>
>  out:
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-04 15:10 [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
  2025-06-04 15:34 ` Peter Xu
  2025-06-05  8:02 ` Barry Song
@ 2025-06-06  0:14 ` Chris Li
  2025-06-11  5:16   ` Kairui Song
  2 siblings, 1 reply; 7+ messages in thread
From: Chris Li @ 2025-06-06  0:14 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Wed, Jun 4, 2025 at 8:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap
> cache lookup, and tries to move the found folio to the faulting vma.
> Currently, it relies on checking the PTE value to ensure that the moved
> folio still belongs to the src swap entry and that no new folio has
> been added to the swap cache, which turns out to be unreliable.
>
> While working and reviewing the swap table series with Barry, following
> existing races are observed and reproduced [1]:
>
> In the example below, move_pages_pte is moving src_pte to dst_pte,
> where src_pte is a swap entry PTE holding swap entry S1, and S1
> is not in the swap cache:
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B can use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry PTE
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
>
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it
> can be reproduced easily.

Thanks for the fix.

Please spell out clearly what is the consequence of the race if
triggered. I assume possible data lost? That should be mentioned in
the first few sentences of the commit message as the user's visible
impact.

>
> This can be fixed by checking if the folio returned by filemap is the
> valid swap cache folio after acquiring the folio lock.
>
> Another similar race is possible: filemap_get_folio may return NULL, but
> folio (A) could be swapped in and then swapped out again using the same
> swap entry after the lookup. In such a case, folio (A) may remain in the
> swap cache, so it must be moved too:
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1, and S1 is not in swap cache
>     folio = filemap_get_folio(S1)
>     // Got NULL
>     ... < interrupted again> ...
>                                    <swapin folio A and free S1>
>                                    <swapout folio A re-using S1>
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // folio A is ignored !!!
>
> Fix this by checking the swap cache again after acquiring the src_pte
> lock. And to avoid the filemap overhead, we check swap_map directly [2].
>
> The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
> far we don't need to worry about that, since folios can only be exposed
> to the swap cache in the swap out path, and this is covered in this
> patch by checking the swap cache again after acquiring the src_pte lock.
>
> Testing with a simple C program that allocates and moves several GB of
> memory did not show any observable performance change.
>
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>
>
> ---
>
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
>
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
>
> V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
> Changes:
> - Add more comments and more context in commit message.
>
>  mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..8253978ee0fb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>                          pte_t orig_dst_pte, pte_t orig_src_pte,
>                          pmd_t *dst_pmd, pmd_t dst_pmdval,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -                        struct folio *src_folio)
> +                        struct folio *src_folio,
> +                        struct swap_info_struct *si, swp_entry_t entry)
>  {
> +       /*
> +        * Check if the folio still belongs to the target swap entry after
> +        * acquiring the lock. Folio can be freed in the swap cache while
> +        * not locked.
> +        */
> +       if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +                                 entry.val != src_folio->swap.val))
> +               return -EAGAIN;
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               /*
> +                * Check if the swap entry is cached after acquiring the src_pte
> +                * lock. Otherwise, we might miss a newly loaded swap cache folio.
> +                *
> +                * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
> +                * We are trying to catch newly added swap cache, the only possible case is
> +                * when a folio is swapped in and out again staying in swap cache, using the
> +                * same entry before the PTE check above. The PTL is acquired and released
> +                * twice, each time after updating the swap_map's flag. So holding
> +                * the PTL here ensures we see the updated value. False positive is possible,
> +                * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
> +                * cache, or during the tiny synchronization window between swap cache and
> +                * swap_map, but it will be gone very quickly, worst result is retry jitters.
> +                */
> +               if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {

Nit: You can use "} else if {" to save one level of indentation.

Reviewed-by: Chris Li <chrisl@kernel.org>

Chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-06  0:14 ` Chris Li
@ 2025-06-11  5:16   ` Kairui Song
  2025-06-12  0:19     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Kairui Song @ 2025-06-11  5:16 UTC (permalink / raw)
  To: Chris Li, Andrew Morton
  Cc: linux-mm, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Fri, Jun 6, 2025 at 8:15 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Jun 4, 2025 at 8:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap
> > cache lookup, and tries to move the found folio to the faulting vma.
> > Currently, it relies on checking the PTE value to ensure that the moved
> > folio still belongs to the src swap entry and that no new folio has
> > been added to the swap cache, which turns out to be unreliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing races are observed and reproduced [1]:
> >
> > In the example below, move_pages_pte is moving src_pte to dst_pte,
> > where src_pte is a swap entry PTE holding swap entry S1, and S1
> > is not in the swap cache:
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B can use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry PTE
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it
> > can be reproduced easily.
>
> Thanks for the fix.
>
> Please spell out clearly what is the consequence of the race if
> triggered. I assume possible data lost? That should be mentioned in
> the first few sentences of the commit message as the user's visible
> impact.

Hi Chris,

This commit fixes two kinds of races, they may have different results:

Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same
BUG_ON if the filemap lookup returned NULL and folio is added to swap
cache after that.

If another kind of race is triggered (folio changed after lookup) we
may see RSS counter is corrupted:

[  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_ANONPAGES val:-1
[  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_SHMEMPAGES val:1

Because the folio is being accounted to the wrong VMA.

I'm not sure if there will be any data corruption though, seems no.
The issues above are critical already.

>
> >
> > This can be fixed by checking if the folio returned by filemap is the
> > valid swap cache folio after acquiring the folio lock.
> >
> > Another similar race is possible: filemap_get_folio may return NULL, but
> > folio (A) could be swapped in and then swapped out again using the same
> > swap entry after the lookup. In such a case, folio (A) may remain in the
> > swap cache, so it must be moved too:
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1, and S1 is not in swap cache
> >     folio = filemap_get_folio(S1)
> >     // Got NULL
> >     ... < interrupted again> ...
> >                                    <swapin folio A and free S1>
> >                                    <swapout folio A re-using S1>
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // folio A is ignored !!!
> >
> > Fix this by checking the swap cache again after acquiring the src_pte
> > lock. And to avoid the filemap overhead, we check swap_map directly [2].
> >
> > The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
> > far we don't need to worry about that, since folios can only be exposed
> > to the swap cache in the swap out path, and this is covered in this
> > patch by checking the swap cache again after acquiring the src_pte lock.
> >
> > Testing with a simple C program that allocates and moves several GB of
> > memory did not show any observable performance change.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>
> >
> > ---
> >
> > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > Changes:
> > - Check swap_map instead of doing a filemap lookup after acquiring the
> >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> >
> > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > Changes:
> > - Move the folio and swap check inside move_swap_pte to avoid skipping
> >   the check and potential overhead [ Lokesh Gidra ]
> > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> >   value.
> >
> > V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
> > Changes:
> > - Add more comments and more context in commit message.
> >
> >  mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..8253978ee0fb 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                          pte_t orig_dst_pte, pte_t orig_src_pte,
> >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                        struct folio *src_folio)
> > +                        struct folio *src_folio,
> > +                        struct swap_info_struct *si, swp_entry_t entry)
> >  {
> > +       /*
> > +        * Check if the folio still belongs to the target swap entry after
> > +        * acquiring the lock. Folio can be freed in the swap cache while
> > +        * not locked.
> > +        */
> > +       if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > +                                 entry.val != src_folio->swap.val))
> > +               return -EAGAIN;
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               /*
> > +                * Check if the swap entry is cached after acquiring the src_pte
> > +                * lock. Otherwise, we might miss a newly loaded swap cache folio.
> > +                *
> > +                * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
> > +                * We are trying to catch newly added swap cache, the only possible case is
> > +                * when a folio is swapped in and out again staying in swap cache, using the
> > +                * same entry before the PTE check above. The PTL is acquired and released
> > +                * twice, each time after updating the swap_map's flag. So holding
> > +                * the PTL here ensures we see the updated value. False positive is possible,
> > +                * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
> > +                * cache, or during the tiny synchronization window between swap cache and
> > +                * swap_map, but it will be gone very quickly, worst result is retry jitters.
> > +                */
> > +               if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
>
> Nit: You can use "} else if {" to save one level of indentation.
>
> Reviewed-by: Chris Li <chrisl@kernel.org>

Thanks!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-11  5:16   ` Kairui Song
@ 2025-06-12  0:19     ` Andrew Morton
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2025-06-12  0:19 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chris Li, linux-mm, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Wed, 11 Jun 2025 13:16:25 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> This commit fixes two kinds of races, they may have different results:
> 
> Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same
> BUG_ON if the filemap lookup returned NULL and folio is added to swap
> cache after that.
> 
> If another kind of race is triggered (folio changed after lookup) we
> may see RSS counter is corrupted:
> 
> [  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_ANONPAGES val:-1
> [  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_SHMEMPAGES val:1
> 
> Because the folio is being accounted to the wrong VMA.
> 
> I'm not sure if there will be any data corruption though, seems no.
> The issues above are critical already.

Thanks, I pasted this into the patch's changelog.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-06-12  0:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 15:10 [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
2025-06-04 15:34 ` Peter Xu
2025-06-04 15:39   ` Suren Baghdasaryan
2025-06-05  8:02 ` Barry Song
2025-06-06  0:14 ` Chris Li
2025-06-11  5:16   ` Kairui Song
2025-06-12  0:19     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).