[PATCH v3] mm: userfaultfd: fix race of userfaultfd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
@ 2025-06-02 18:14 Kairui Song
  2025-06-02 20:17 ` Lokesh Gidra
  2025-06-02 20:34 ` Peter Xu
  0 siblings, 2 replies; 7+ messages in thread
From: Kairui Song @ 2025-06-02 18:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and try to move the found folio to the faulting vma when.
Currently, it relies on the PTE value check to ensure the moved folio
still belongs to the src swap entry, which turns out is not reliable.

While working and reviewing the swap table series with Barry, following
existing race is observed and reproduced [1]:

( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
 swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1
    ... < Somehow interrupted> ...
                                   <swapin src_pte, alloc and use folio A>
                                   // folio A is just a new allocated folio
                                   // and get installed into src_pte
                                   <frees swap entry S1>
                                   // src_pte now points to folio A, S1
                                   // has swap count == 0, it can be freed
                                   // by folio_swap_swap or swap
                                   // allocator's reclaim.
                                   <try to swap out another folio B>
                                   // folio B is a folio in another VMA.
                                   <put folio B to swap cache using S1 >
                                   // S1 is freed, folio B could use it
                                   // for swap out with no problem.
                                   ...
    folio = filemap_get_folio(S1)
    // Got folio B here !!!
    ... < Somehow interrupted again> ...
                                   <swapin folio B and free S1>
                                   // Now S1 is free to be used again.
                                   <swapout src_pte & folio A using S1>
                                   // Now src_pte is a swap entry pte
                                   // holding S1 again.
    folio_trylock(folio)
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // Moved invalid folio B here !!!

The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it can be
reproduced [1].

It's also possible that folio (A) is swapped in, and swapped out again
after the filemap_get_folio lookup, in such case folio (A) may stay in
swap cache so it needs to be moved too. In this case we should also try
again so kernel won't miss a folio move.

Fix this by checking if the folio is the valid swap cache folio after
acquiring the folio lock, and checking the swap cache again after
acquiring the src_pte lock.

SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
we don't need to worry about that since folios only might get exposed to
swap cache in the swap out path, and it's covered in this patch too by
checking the swap cache again after acquiring src_pte lock.

Testing with a simple C program to allocate and move several GB of memory
did not show any observable performance change.

Cc: <stable@vger.kernel.org>
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>

---

V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
Changes:
- Check swap_map instead of doing a filemap lookup after acquiring the
  PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]

V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
Changes:
- Move the folio and swap check inside move_swap_pte to avoid skipping
  the check and potential overhead [ Lokesh Gidra ]
- Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
  value.

 mm/userfaultfd.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..5dc05346e360 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 			 pte_t orig_dst_pte, pte_t orig_src_pte,
 			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			 struct folio *src_folio)
+			 struct folio *src_folio,
+			 struct swap_info_struct *si, swp_entry_t entry)
 {
+	/*
+	 * Check if the folio still belongs to the target swap entry after
+	 * acquiring the lock. Folio can be freed in the swap cache while
+	 * not locked.
+	 */
+	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+				  entry.val != src_folio->swap.val))
+		return -EAGAIN;
+
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 	if (src_folio) {
 		folio_move_anon_rmap(src_folio, dst_vma);
 		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	} else {
+		/*
+		 * Check if the swap entry is cached after acquiring the src_pte
+		 * lock. Or we might miss a new loaded swap cache folio.
+		 */
+		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
 	}
 
 	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		}
 		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
 				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
-				dst_ptl, src_ptl, src_folio);
+				dst_ptl, src_ptl, src_folio, si, entry);
 	}
 
 out:
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-02 18:14 [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
@ 2025-06-02 20:17 ` Lokesh Gidra
  2025-06-02 20:34 ` Peter Xu
  1 sibling, 0 replies; 7+ messages in thread
From: Lokesh Gidra @ 2025-06-02 20:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, stable, linux-kernel,
	Kalesh Singh

On Mon, Jun 2, 2025 at 11:14 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> lookup, and try to move the found folio to the faulting vma when.
> Currently, it relies on the PTE value check to ensure the moved folio
> still belongs to the src swap entry, which turns out is not reliable.
>
> While working and reviewing the swap table series with Barry, following
> existing race is observed and reproduced [1]:
>
> ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
>  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < Somehow interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is just a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B could use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < Somehow interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry pte
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
>
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it can be
> reproduced [1].
>
> It's also possible that folio (A) is swapped in, and swapped out again
> after the filemap_get_folio lookup, in such case folio (A) may stay in
> swap cache so it needs to be moved too. In this case we should also try
> again so kernel won't miss a folio move.
>
> Fix this by checking if the folio is the valid swap cache folio after
> acquiring the folio lock, and checking the swap cache again after
> acquiring the src_pte lock.
>
> SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> we don't need to worry about that since folios only might get exposed to
> swap cache in the swap out path, and it's covered in this patch too by
> checking the swap cache again after acquiring src_pte lock.
>
> Testing with a simple C program to allocate and move several GB of memory
> did not show any observable performance change.
>
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Lokesh Gidra <lokeshgidra@google.com>
>
> ---
>
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
>
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
>
>  mm/userfaultfd.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..5dc05346e360 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>                          pte_t orig_dst_pte, pte_t orig_src_pte,
>                          pmd_t *dst_pmd, pmd_t dst_pmdval,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -                        struct folio *src_folio)
> +                        struct folio *src_folio,
> +                        struct swap_info_struct *si, swp_entry_t entry)
>  {
> +       /*
> +        * Check if the folio still belongs to the target swap entry after
> +        * acquiring the lock. Folio can be freed in the swap cache while
> +        * not locked.
> +        */
> +       if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +                                 entry.val != src_folio->swap.val))
> +               return -EAGAIN;
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               /*
> +                * Check if the swap entry is cached after acquiring the src_pte
> +                * lock. Or we might miss a new loaded swap cache folio.
> +                */
> +               if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
>         }
>
>         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                 }
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> -                               dst_ptl, src_ptl, src_folio);
> +                               dst_ptl, src_ptl, src_folio, si, entry);
>         }
>
>  out:
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-02 18:14 [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
  2025-06-02 20:17 ` Lokesh Gidra
@ 2025-06-02 20:34 ` Peter Xu
  2025-06-02 22:08   ` Barry Song
  2025-06-03  7:03   ` Kairui Song
  1 sibling, 2 replies; 7+ messages in thread
From: Peter Xu @ 2025-06-02 20:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> lookup, and try to move the found folio to the faulting vma when.
> Currently, it relies on the PTE value check to ensure the moved folio
> still belongs to the src swap entry, which turns out is not reliable.
> 
> While working and reviewing the swap table series with Barry, following
> existing race is observed and reproduced [1]:
> 
> ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
>  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> 
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < Somehow interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is just a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B could use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < Somehow interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry pte
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
> 
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it can be
> reproduced [1].
> 
> It's also possible that folio (A) is swapped in, and swapped out again
> after the filemap_get_folio lookup, in such case folio (A) may stay in
> swap cache so it needs to be moved too. In this case we should also try
> again so kernel won't miss a folio move.
> 
> Fix this by checking if the folio is the valid swap cache folio after
> acquiring the folio lock, and checking the swap cache again after
> acquiring the src_pte lock.
> 
> SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> we don't need to worry about that since folios only might get exposed to
> swap cache in the swap out path, and it's covered in this patch too by
> checking the swap cache again after acquiring src_pte lock.

[1]

> 
> Testing with a simple C program to allocate and move several GB of memory
> did not show any observable performance change.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> ---
> 
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> 
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
> 
>  mm/userfaultfd.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..5dc05346e360 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  			 pte_t orig_dst_pte, pte_t orig_src_pte,
>  			 pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			 struct folio *src_folio)
> +			 struct folio *src_folio,
> +			 struct swap_info_struct *si, swp_entry_t entry)
>  {
> +	/*
> +	 * Check if the folio still belongs to the target swap entry after
> +	 * acquiring the lock. Folio can be freed in the swap cache while
> +	 * not locked.
> +	 */
> +	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +				  entry.val != src_folio->swap.val))
> +		return -EAGAIN;
> +
>  	double_pt_lock(dst_ptl, src_ptl);
>  
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  	if (src_folio) {
>  		folio_move_anon_rmap(src_folio, dst_vma);
>  		src_folio->index = linear_page_index(dst_vma, dst_addr);
> +	} else {
> +		/*
> +		 * Check if the swap entry is cached after acquiring the src_pte
> +		 * lock. Or we might miss a new loaded swap cache folio.
> +		 */
> +		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {

Do we need data_race() for this, if this is an intentionally lockless read?

Another pure swap question: the comment seems to imply this whole thing is
protected by src_pte lock, but is it?

I'm not familiar enough with swap code, but it looks to me the folio can be
added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
folio is locked.  It doesn't seem to be directly protected by pgtable lock.

Perhaps you meant this: since src_pte lock is held, then it'll serialize
with another thread B concurrently swap-in the swap entry, but only _later_
when thread B's do_swap_page() will check again on pte_same(), then it'll
see the src pte gone (after thread A uffdio_move happened releasing src_pte
lock), hence thread B will release the newly allocated swap cache folio?

There's another trivial detail that IIUC pte_same() must fail because
before/after the uffdio_move the swap entry will be occupied so no way to
have it reused, hence src_pte, even if re-populated again after uffdio_move
succeeded, cannot become the orig_pte (points to the swap entry in
question) that thread B read, hence pte_same() must check fail.

I'm not sure my understanding is correct, though.  Maybe some richer
comment would always help.

Thanks,

> +			double_pt_unlock(dst_ptl, src_ptl);
> +			return -EAGAIN;
> +		}
>  	}
>  
>  	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		}
>  		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>  				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> -				dst_ptl, src_ptl, src_folio);
> +				dst_ptl, src_ptl, src_folio, si, entry);
>  	}
>  
>  out:
> -- 
> 2.49.0
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-02 20:34 ` Peter Xu
@ 2025-06-02 22:08   ` Barry Song
  2025-06-03 11:48     ` Kairui Song
  2025-06-03  7:03   ` Kairui Song
  1 sibling, 1 reply; 7+ messages in thread
From: Barry Song @ 2025-06-02 22:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kairui Song, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Tue, Jun 3, 2025 at 8:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > lookup, and try to move the found folio to the faulting vma when.
> > Currently, it relies on the PTE value check to ensure the moved folio
> > still belongs to the src swap entry, which turns out is not reliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing race is observed and reproduced [1]:
> >
> > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < Somehow interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is just a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B could use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < Somehow interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry pte
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it can be
> > reproduced [1].
> >
> > It's also possible that folio (A) is swapped in, and swapped out again
> > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > swap cache so it needs to be moved too. In this case we should also try
> > again so kernel won't miss a folio move.
> >
> > Fix this by checking if the folio is the valid swap cache folio after
> > acquiring the folio lock, and checking the swap cache again after
> > acquiring the src_pte lock.
> >
> > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > we don't need to worry about that since folios only might get exposed to
> > swap cache in the swap out path, and it's covered in this patch too by
> > checking the swap cache again after acquiring src_pte lock.
>
> [1]
>
> >
> > Testing with a simple C program to allocate and move several GB of memory
> > did not show any observable performance change.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > ---
> >
> > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > Changes:
> > - Check swap_map instead of doing a filemap lookup after acquiring the
> >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> >
> > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > Changes:
> > - Move the folio and swap check inside move_swap_pte to avoid skipping
> >   the check and potential overhead [ Lokesh Gidra ]
> > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> >   value.
> >
> >  mm/userfaultfd.c | 23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..5dc05346e360 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                        spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                      struct folio *src_folio)
> > +                      struct folio *src_folio,
> > +                      struct swap_info_struct *si, swp_entry_t entry)
> >  {
> > +     /*
> > +      * Check if the folio still belongs to the target swap entry after
> > +      * acquiring the lock. Folio can be freed in the swap cache while
> > +      * not locked.
> > +      */
> > +     if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > +                               entry.val != src_folio->swap.val))
> > +             return -EAGAIN;
> > +
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >       if (src_folio) {
> >               folio_move_anon_rmap(src_folio, dst_vma);
> >               src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +     } else {
> > +             /*
> > +              * Check if the swap entry is cached after acquiring the src_pte
> > +              * lock. Or we might miss a new loaded swap cache folio.
> > +              */
> > +             if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
>
> Do we need data_race() for this, if this is an intentionally lockless read?

Not entirely sure. But I recommend this pattern, borrowed from
zap_nonpresent_ptes() -> free_swap_and_cache_nr(),
where the PTL is also held and READ_ONCE is used.

                if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
                       ..
                        nr = __try_to_reclaim_swap(si, offset,
                                                   TTRS_UNMAPPED | TTRS_FULL);

                        if (nr == 0)
                                nr = 1;
                        else if (nr < 0)
                                nr = -nr;
                        nr = ALIGN(offset + 1, nr) - offset;
                }

I think we could use this to further optimize the existing
filemap_get_folio(), since in the vast majority of cases we don't
have a swapcache, yet we still always call filemap_get_folio().

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..c527ec73c3b4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c

@@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                 * folios in the swapcache. This issue needs to be resolved
                 * separately to allow proper handling.
                 */

-               if (!src_folio)
+               if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE))
                        folio = filemap_get_folio(swap_address_space(entry),
                                        swap_cache_index(entry));
                if (!IS_ERR_OR_NULL(folio)) {

To be future-proof, we may want to keep the READ_ONCE to ensure
the compiler doesn't skip the second read inside move_swap_pte().

>
> Another pure swap question: the comment seems to imply this whole thing is
> protected by src_pte lock, but is it?
>
> I'm not familiar enough with swap code, but it looks to me the folio can be
> added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
> folio is locked.  It doesn't seem to be directly protected by pgtable lock.
>
> Perhaps you meant this: since src_pte lock is held, then it'll serialize
> with another thread B concurrently swap-in the swap entry, but only _later_
> when thread B's do_swap_page() will check again on pte_same(), then it'll
> see the src pte gone (after thread A uffdio_move happened releasing src_pte
> lock), hence thread B will release the newly allocated swap cache folio?
>
> There's another trivial detail that IIUC pte_same() must fail because
> before/after the uffdio_move the swap entry will be occupied so no way to
> have it reused, hence src_pte, even if re-populated again after uffdio_move
> succeeded, cannot become the orig_pte (points to the swap entry in
> question) that thread B read, hence pte_same() must check fail.

in v1 of this patch, we had some similar discussions [1][2]:

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=iohbYf64aYyJ55t0Z11FkwA@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp=ArBuv3cJijHVXe6w@mail.gmail.com/

At the very least, [2] is possible, although the probability is extremely low.

"It seems also possible for the sync zRAM device.

 step 1: src pte points to a swap entry S without swapcache
 step 2: we call move_swap_pte()
 step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
S is freed.
             -- for zRAM;
 step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
             adds folio to swapcache due to swapout;
 step 5: move_swap_pte() gets ptl and finds page tables are stable
since swap-out
             happens to have the same swap slot as step1;
 step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.

Yep, we really need to re-check pte for swapcache after holding PTL.
"

Personally, I agree that improving the changelog or the comments
would be more helpful. In fact, there are two bugs here, and Kairui’s
changelog clearly describes the first one.

>
> I'm not sure my understanding is correct, though.  Maybe some richer
> comment would always help.
>
> Thanks,
>
> > +                     double_pt_unlock(dst_ptl, src_ptl);
> > +                     return -EAGAIN;
> > +             }
> >       }
> >
> >       orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >               }
> >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > -                             dst_ptl, src_ptl, src_folio);
> > +                             dst_ptl, src_ptl, src_folio, si, entry);
> >       }
> >
> >  out:
> > --
> > 2.49.0
> >
> >
>
> --
> Peter Xu
>

Thanks
barry


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-02 20:34 ` Peter Xu
  2025-06-02 22:08   ` Barry Song
@ 2025-06-03  7:03   ` Kairui Song
  1 sibling, 0 replies; 7+ messages in thread
From: Kairui Song @ 2025-06-03  7:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Andrew Morton, Barry Song, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Tue, Jun 3, 2025 at 4:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > lookup, and try to move the found folio to the faulting vma when.
> > Currently, it relies on the PTE value check to ensure the moved folio
> > still belongs to the src swap entry, which turns out is not reliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing race is observed and reproduced [1]:
> >
> > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < Somehow interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is just a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B could use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < Somehow interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry pte
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it can be
> > reproduced [1].
> >
> > It's also possible that folio (A) is swapped in, and swapped out again
> > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > swap cache so it needs to be moved too. In this case we should also try
> > again so kernel won't miss a folio move.
> >
> > Fix this by checking if the folio is the valid swap cache folio after
> > acquiring the folio lock, and checking the swap cache again after
> > acquiring the src_pte lock.
> >
> > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > we don't need to worry about that since folios only might get exposed to
> > swap cache in the swap out path, and it's covered in this patch too by
> > checking the swap cache again after acquiring src_pte lock.
>
> [1]
>
> >
> > Testing with a simple C program to allocate and move several GB of memory
> > did not show any observable performance change.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > ---
> >
> > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > Changes:
> > - Check swap_map instead of doing a filemap lookup after acquiring the
> >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> >
> > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > Changes:
> > - Move the folio and swap check inside move_swap_pte to avoid skipping
> >   the check and potential overhead [ Lokesh Gidra ]
> > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> >   value.
> >
> >  mm/userfaultfd.c | 23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..5dc05346e360 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                        spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                      struct folio *src_folio)
> > +                      struct folio *src_folio,
> > +                      struct swap_info_struct *si, swp_entry_t entry)
> >  {
> > +     /*
> > +      * Check if the folio still belongs to the target swap entry after
> > +      * acquiring the lock. Folio can be freed in the swap cache while
> > +      * not locked.
> > +      */
> > +     if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > +                               entry.val != src_folio->swap.val))
> > +             return -EAGAIN;
> > +
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >       if (src_folio) {
> >               folio_move_anon_rmap(src_folio, dst_vma);
> >               src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +     } else {
> > +             /*
> > +              * Check if the swap entry is cached after acquiring the src_pte
> > +              * lock. Or we might miss a new loaded swap cache folio.
> > +              */
> > +             if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {

Hi Peter, Thanks for the review!

>
> Do we need data_race() for this, if this is an intentionally lockless read?
>
> Another pure swap question: the comment seems to imply this whole thing is
> protected by src_pte lock, but is it?

It's tricky here, in theory we do need to hold the swap cluster lock
to read the swap_map, or it will look kind of hackish (which is the
reason I tried to avoid using that SWAP_HAS_CACHE bit in V1). But
I think it's OK to be lockless here because the only case where it
could go wrong is:
A concurrent swapin allocated a folio for the entry and updated
the PTE releasing the swap entry, then another swapout left the same
folio in swap cache and changed the PTE back reusing the entry. PTE
lock is acquired then released twice here, so holding the PTE lock
here will surely see the updated swap_map value. In other cases, the
is_pte_pages_stable check will fail and there is no race issue.

> I'm not familiar enough with swap code, but it looks to me the folio can be
> added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
> folio is locked.  It doesn't seem to be directly protected by pgtable lock.

You mean "as long as the folio is unlocked" right?

> Perhaps you meant this: since src_pte lock is held, then it'll serialize
> with another thread B concurrently swap-in the swap entry, but only _later_
> when thread B's do_swap_page() will check again on pte_same(), then it'll
> see the src pte gone (after thread A uffdio_move happened releasing src_pte
> lock), hence thread B will release the newly allocated swap cache folio?

The case you described is valid, and there is no bug with that, it's
not the case I'm fixing here.

> There's another trivial detail that IIUC pte_same() must fail because
> before/after the uffdio_move the swap entry will be occupied so no way to
> have it reused, hence src_pte, even if re-populated again after uffdio_move
> succeeded, cannot become the orig_pte (points to the swap entry in
> question) that thread B read, hence pte_same() must check fail.

Swap entry can be freed before the uffidio_move's PTE lock: a swapin
can finish and release it. Then another swapout can reuse it.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-02 22:08   ` Barry Song
@ 2025-06-03 11:48     ` Kairui Song
  2025-06-03 16:42       ` Peter Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Kairui Song @ 2025-06-03 11:48 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Xu, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Tue, Jun 3, 2025 at 6:08 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Jun 3, 2025 at 8:34 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > lookup, and try to move the found folio to the faulting vma when.
> > > Currently, it relies on the PTE value check to ensure the moved folio
> > > still belongs to the src swap entry, which turns out is not reliable.
> > >
> > > While working and reviewing the swap table series with Barry, following
> > > existing race is observed and reproduced [1]:
> > >
> > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > >
> > > CPU1                               CPU2
> > > userfaultfd_move
> > >   move_pages_pte()
> > >     entry = pte_to_swp_entry(orig_src_pte);
> > >     // Here it got entry = S1
> > >     ... < Somehow interrupted> ...
> > >                                    <swapin src_pte, alloc and use folio A>
> > >                                    // folio A is just a new allocated folio
> > >                                    // and get installed into src_pte
> > >                                    <frees swap entry S1>
> > >                                    // src_pte now points to folio A, S1
> > >                                    // has swap count == 0, it can be freed
> > >                                    // by folio_swap_swap or swap
> > >                                    // allocator's reclaim.
> > >                                    <try to swap out another folio B>
> > >                                    // folio B is a folio in another VMA.
> > >                                    <put folio B to swap cache using S1 >
> > >                                    // S1 is freed, folio B could use it
> > >                                    // for swap out with no problem.
> > >                                    ...
> > >     folio = filemap_get_folio(S1)
> > >     // Got folio B here !!!
> > >     ... < Somehow interrupted again> ...
> > >                                    <swapin folio B and free S1>
> > >                                    // Now S1 is free to be used again.
> > >                                    <swapout src_pte & folio A using S1>
> > >                                    // Now src_pte is a swap entry pte
> > >                                    // holding S1 again.
> > >     folio_trylock(folio)
> > >     move_swap_pte
> > >       double_pt_lock
> > >       is_pte_pages_stable
> > >       // Check passed because src_pte == S1
> > >       folio_move_anon_rmap(...)
> > >       // Moved invalid folio B here !!!
> > >
> > > The race window is very short and requires multiple collisions of
> > > multiple rare events, so it's very unlikely to happen, but with a
> > > deliberately constructed reproducer and increased time window, it can be
> > > reproduced [1].
> > >
> > > It's also possible that folio (A) is swapped in, and swapped out again
> > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > swap cache so it needs to be moved too. In this case we should also try
> > > again so kernel won't miss a folio move.
> > >
> > > Fix this by checking if the folio is the valid swap cache folio after
> > > acquiring the folio lock, and checking the swap cache again after
> > > acquiring the src_pte lock.
> > >
> > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > we don't need to worry about that since folios only might get exposed to
> > > swap cache in the swap out path, and it's covered in this patch too by
> > > checking the swap cache again after acquiring src_pte lock.
> >
> > [1]
> >
> > >
> > > Testing with a simple C program to allocate and move several GB of memory
> > > did not show any observable performance change.
> > >
> > > Cc: <stable@vger.kernel.org>
> > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > >
> > > ---
> > >
> > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > > Changes:
> > > - Check swap_map instead of doing a filemap lookup after acquiring the
> > >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> > >
> > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > > Changes:
> > > - Move the folio and swap check inside move_swap_pte to avoid skipping
> > >   the check and potential overhead [ Lokesh Gidra ]
> > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> > >   value.
> > >
> > >  mm/userfaultfd.c | 23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..5dc05346e360 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> > >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> > >                        spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > -                      struct folio *src_folio)
> > > +                      struct folio *src_folio,
> > > +                      struct swap_info_struct *si, swp_entry_t entry)
> > >  {
> > > +     /*
> > > +      * Check if the folio still belongs to the target swap entry after
> > > +      * acquiring the lock. Folio can be freed in the swap cache while
> > > +      * not locked.
> > > +      */
> > > +     if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > > +                               entry.val != src_folio->swap.val))
> > > +             return -EAGAIN;
> > > +
> > >       double_pt_lock(dst_ptl, src_ptl);
> > >
> > >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >       if (src_folio) {
> > >               folio_move_anon_rmap(src_folio, dst_vma);
> > >               src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +     } else {
> > > +             /*
> > > +              * Check if the swap entry is cached after acquiring the src_pte
> > > +              * lock. Or we might miss a new loaded swap cache folio.
> > > +              */
> > > +             if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> >
> > Do we need data_race() for this, if this is an intentionally lockless read?
>
> Not entirely sure. But I recommend this pattern, borrowed from
> zap_nonpresent_ptes() -> free_swap_and_cache_nr(),
> where the PTL is also held and READ_ONCE is used.
>
>                 if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
>                        ..
>                         nr = __try_to_reclaim_swap(si, offset,
>                                                    TTRS_UNMAPPED | TTRS_FULL);
>
>                         if (nr == 0)
>                                 nr = 1;
>                         else if (nr < 0)
>                                 nr = -nr;
>                         nr = ALIGN(offset + 1, nr) - offset;
>                 }

Thanks for the explanation, I also agree that holding PTL here is good
enough here.

>
> I think we could use this to further optimize the existing
> filemap_get_folio(), since in the vast majority of cases we don't
> have a swapcache, yet we still always call filemap_get_folio().
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..c527ec73c3b4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
>
> @@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                  * folios in the swapcache. This issue needs to be resolved
>                  * separately to allow proper handling.
>                  */
>
> -               if (!src_folio)
> +               if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE))
>                         folio = filemap_get_folio(swap_address_space(entry),
>                                         swap_cache_index(entry));
>                 if (!IS_ERR_OR_NULL(folio)) {
>
> To be future-proof, we may want to keep the READ_ONCE to ensure
> the compiler doesn't skip the second read inside move_swap_pte().

Maybe we can do this optimization in another patch I think.

>
> >
> > Another pure swap question: the comment seems to imply this whole thing is
> > protected by src_pte lock, but is it?
> >
> > I'm not familiar enough with swap code, but it looks to me the folio can be
> > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
> > folio is locked.  It doesn't seem to be directly protected by pgtable lock.
> >
> > Perhaps you meant this: since src_pte lock is held, then it'll serialize
> > with another thread B concurrently swap-in the swap entry, but only _later_
> > when thread B's do_swap_page() will check again on pte_same(), then it'll
> > see the src pte gone (after thread A uffdio_move happened releasing src_pte
> > lock), hence thread B will release the newly allocated swap cache folio?
> >
> > There's another trivial detail that IIUC pte_same() must fail because
> > before/after the uffdio_move the swap entry will be occupied so no way to
> > have it reused, hence src_pte, even if re-populated again after uffdio_move
> > succeeded, cannot become the orig_pte (points to the swap entry in
> > question) that thread B read, hence pte_same() must check fail.
>
> in v1 of this patch, we had some similar discussions [1][2]:
>
> [1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=iohbYf64aYyJ55t0Z11FkwA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp=ArBuv3cJijHVXe6w@mail.gmail.com/
>
> At the very least, [2] is possible, although the probability is extremely low.
>
> "It seems also possible for the sync zRAM device.
>
>  step 1: src pte points to a swap entry S without swapcache
>  step 2: we call move_swap_pte()
>  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> S is freed.
>              -- for zRAM;
>  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
>              adds folio to swapcache due to swapout;
>  step 5: move_swap_pte() gets ptl and finds page tables are stable
> since swap-out
>              happens to have the same swap slot as step1;
>  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
>
> Yep, we really need to re-check pte for swapcache after holding PTL.
> "
>
> Personally, I agree that improving the changelog or the comments
> would be more helpful. In fact, there are two bugs here, and Kairui’s
> changelog clearly describes the first one.

Yeah, the first one is quite a long and complex race already, so I
made the description on the second issue shorter. I thought it
wouldn't be too difficult to understand given the first example. I can
add some more details.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-06-03 11:48     ` Kairui Song
@ 2025-06-03 16:42       ` Peter Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Xu @ 2025-06-03 16:42 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Tue, Jun 03, 2025 at 07:48:49PM +0800, Kairui Song wrote:
> On Tue, Jun 3, 2025 at 6:08 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 8:34 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > lookup, and try to move the found folio to the faulting vma when.
> > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > still belongs to the src swap entry, which turns out is not reliable.
> > > >
> > > > While working and reviewing the swap table series with Barry, following
> > > > existing race is observed and reproduced [1]:
> > > >
> > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > >
> > > > CPU1                               CPU2
> > > > userfaultfd_move
> > > >   move_pages_pte()
> > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > >     // Here it got entry = S1
> > > >     ... < Somehow interrupted> ...
> > > >                                    <swapin src_pte, alloc and use folio A>
> > > >                                    // folio A is just a new allocated folio
> > > >                                    // and get installed into src_pte
> > > >                                    <frees swap entry S1>
> > > >                                    // src_pte now points to folio A, S1
> > > >                                    // has swap count == 0, it can be freed
> > > >                                    // by folio_swap_swap or swap
> > > >                                    // allocator's reclaim.
> > > >                                    <try to swap out another folio B>
> > > >                                    // folio B is a folio in another VMA.
> > > >                                    <put folio B to swap cache using S1 >
> > > >                                    // S1 is freed, folio B could use it
> > > >                                    // for swap out with no problem.
> > > >                                    ...
> > > >     folio = filemap_get_folio(S1)
> > > >     // Got folio B here !!!
> > > >     ... < Somehow interrupted again> ...
> > > >                                    <swapin folio B and free S1>
> > > >                                    // Now S1 is free to be used again.
> > > >                                    <swapout src_pte & folio A using S1>
> > > >                                    // Now src_pte is a swap entry pte
> > > >                                    // holding S1 again.
> > > >     folio_trylock(folio)
> > > >     move_swap_pte
> > > >       double_pt_lock
> > > >       is_pte_pages_stable
> > > >       // Check passed because src_pte == S1
> > > >       folio_move_anon_rmap(...)
> > > >       // Moved invalid folio B here !!!
> > > >
> > > > The race window is very short and requires multiple collisions of
> > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > deliberately constructed reproducer and increased time window, it can be
> > > > reproduced [1].
> > > >
> > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > swap cache so it needs to be moved too. In this case we should also try
> > > > again so kernel won't miss a folio move.
> > > >
> > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > acquiring the folio lock, and checking the swap cache again after
> > > > acquiring the src_pte lock.
> > > >
> > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > we don't need to worry about that since folios only might get exposed to
> > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > checking the swap cache again after acquiring src_pte lock.
> > >
> > > [1]
> > >
> > > >
> > > > Testing with a simple C program to allocate and move several GB of memory
> > > > did not show any observable performance change.
> > > >
> > > > Cc: <stable@vger.kernel.org>
> > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > >
> > > > ---
> > > >
> > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> > > > Changes:
> > > > - Check swap_map instead of doing a filemap lookup after acquiring the
> > > >   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> > > >
> > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> > > > Changes:
> > > > - Move the folio and swap check inside move_swap_pte to avoid skipping
> > > >   the check and potential overhead [ Lokesh Gidra ]
> > > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
> > > >   value.
> > > >
> > > >  mm/userfaultfd.c | 23 +++++++++++++++++++++--
> > > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..5dc05346e360 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> > > >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > >                        spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > -                      struct folio *src_folio)
> > > > +                      struct folio *src_folio,
> > > > +                      struct swap_info_struct *si, swp_entry_t entry)
> > > >  {
> > > > +     /*
> > > > +      * Check if the folio still belongs to the target swap entry after
> > > > +      * acquiring the lock. Folio can be freed in the swap cache while
> > > > +      * not locked.
> > > > +      */
> > > > +     if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> > > > +                               entry.val != src_folio->swap.val))
> > > > +             return -EAGAIN;
> > > > +
> > > >       double_pt_lock(dst_ptl, src_ptl);
> > > >
> > > >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >       if (src_folio) {
> > > >               folio_move_anon_rmap(src_folio, dst_vma);
> > > >               src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > +     } else {
> > > > +             /*
> > > > +              * Check if the swap entry is cached after acquiring the src_pte
> > > > +              * lock. Or we might miss a new loaded swap cache folio.
> > > > +              */
> > > > +             if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
> > >
> > > Do we need data_race() for this, if this is an intentionally lockless read?
> >
> > Not entirely sure. But I recommend this pattern, borrowed from

AFAIU data_race() is only for KCSAN.  READ_ONCE/WRITE_ONCE will also be
needed.  The doc actually explicitly mentioned this case:

/*
 * ...
 * be atomic *and* KCSAN should ignore the access, use both data_race()
 * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
 */
#define data_race(expr)							\

I'm ok if there're existing references of swap_map[], so even if it's good
to have data_race() we can do that later until someone complains at all the
spots..

> > zap_nonpresent_ptes() -> free_swap_and_cache_nr(),
> > where the PTL is also held and READ_ONCE is used.
> >
> >                 if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> >                        ..
> >                         nr = __try_to_reclaim_swap(si, offset,
> >                                                    TTRS_UNMAPPED | TTRS_FULL);
> >
> >                         if (nr == 0)
> >                                 nr = 1;
> >                         else if (nr < 0)
> >                                 nr = -nr;
> >                         nr = ALIGN(offset + 1, nr) - offset;
> >                 }
> 
> Thanks for the explanation, I also agree that holding PTL here is good
> enough here.
> 
> >
> > I think we could use this to further optimize the existing
> > filemap_get_folio(), since in the vast majority of cases we don't
> > have a swapcache, yet we still always call filemap_get_folio().
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..c527ec73c3b4 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> >
> > @@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                  * folios in the swapcache. This issue needs to be resolved
> >                  * separately to allow proper handling.
> >                  */
> >
> > -               if (!src_folio)
> > +               if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE))
> >                         folio = filemap_get_folio(swap_address_space(entry),
> >                                         swap_cache_index(entry));
> >                 if (!IS_ERR_OR_NULL(folio)) {
> >
> > To be future-proof, we may want to keep the READ_ONCE to ensure
> > the compiler doesn't skip the second read inside move_swap_pte().
> 
> Maybe we can do this optimization in another patch I think.
> 
> >
> > >
> > > Another pure swap question: the comment seems to imply this whole thing is
> > > protected by src_pte lock, but is it?
> > >
> > > I'm not familiar enough with swap code, but it looks to me the folio can be
> > > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
> > > folio is locked.  It doesn't seem to be directly protected by pgtable lock.
> > >
> > > Perhaps you meant this: since src_pte lock is held, then it'll serialize
> > > with another thread B concurrently swap-in the swap entry, but only _later_
> > > when thread B's do_swap_page() will check again on pte_same(), then it'll
> > > see the src pte gone (after thread A uffdio_move happened releasing src_pte
> > > lock), hence thread B will release the newly allocated swap cache folio?
> > >
> > > There's another trivial detail that IIUC pte_same() must fail because
> > > before/after the uffdio_move the swap entry will be occupied so no way to
> > > have it reused, hence src_pte, even if re-populated again after uffdio_move
> > > succeeded, cannot become the orig_pte (points to the swap entry in
> > > question) that thread B read, hence pte_same() must check fail.
> >
> > in v1 of this patch, we had some similar discussions [1][2]:
> >
> > [1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=iohbYf64aYyJ55t0Z11FkwA@mail.gmail.com/
> > [2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp=ArBuv3cJijHVXe6w@mail.gmail.com/
> >
> > At the very least, [2] is possible, although the probability is extremely low.
> >
> > "It seems also possible for the sync zRAM device.
> >
> >  step 1: src pte points to a swap entry S without swapcache
> >  step 2: we call move_swap_pte()
> >  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> > S is freed.
> >              -- for zRAM;
> >  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
> >              adds folio to swapcache due to swapout;
> >  step 5: move_swap_pte() gets ptl and finds page tables are stable
> > since swap-out
> >              happens to have the same swap slot as step1;
> >  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
> >
> > Yep, we really need to re-check pte for swapcache after holding PTL.
> > "
> >
> > Personally, I agree that improving the changelog or the comments
> > would be more helpful. In fact, there are two bugs here, and Kairui’s
> > changelog clearly describes the first one.
> 
> Yeah, the first one is quite a long and complex race already, so I
> made the description on the second issue shorter. I thought it
> wouldn't be too difficult to understand given the first example. I can
> add some more details.

IMHO it's not about how the race happens that is hard to follow.  To me,
it's harder to follow on how src_pte lock can prove READ_ONCE() of
swap_map[] is valid.  Especially, on why it's okay to read even false
negative (relies on the other thread retry properly in do_swap_page).

It'll be much appreciated if you could add something into either comments
or git commit message on this. Maybe a link to this specific lore
discussion (or v1's) would be helpful.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-06-03 16:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 18:14 [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
2025-06-02 20:17 ` Lokesh Gidra
2025-06-02 20:34 ` Peter Xu
2025-06-02 22:08   ` Barry Song
2025-06-03 11:48     ` Kairui Song
2025-06-03 16:42       ` Peter Xu
2025-06-03  7:03   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).