[PATCH] mm: userfaultfd: fix race of userfaultfd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
@ 2025-05-30 20:17 Kairui Song
  2025-05-30 23:40 ` Lokesh Gidra
  2025-05-31  4:04 ` Barry Song
  0 siblings, 2 replies; 16+ messages in thread
From: Kairui Song @ 2025-05-30 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and try to move the found folio to the faulting vma when.
Currently, it relies on the PTE value check to ensure the moved folio
still belongs to the src swap entry, which turns out is not reliable.

While working and reviewing the swap table series with Barry, following
existing race is observed and reproduced [1]:

( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
 swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1
    ... < Somehow interrupted> ...
                                   <swapin src_pte, alloc and use folio A>
                                   // folio A is just a new allocated folio
                                   // and get installed into src_pte
                                   <frees swap entry S1>
                                   // src_pte now points to folio A, S1
                                   // has swap count == 0, it can be freed
                                   // by folio_swap_swap or swap
                                   // allocator's reclaim.
                                   <try to swap out another folio B>
                                   // folio B is a folio in another VMA.
                                   <put folio B to swap cache using S1 >
                                   // S1 is freed, folio B could use it
                                   // for swap out with no problem.
                                   ...
    folio = filemap_get_folio(S1)
    // Got folio B here !!!
    ... < Somehow interrupted again> ...
                                   <swapin folio B and free S1>
                                   // Now S1 is free to be used again.
                                   <swapout src_pte & folio A using S1>
				   // Now src_pte is a swap entry pte
				   // holding S1 again.
    folio_trylock(folio)
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // Moved invalid folio B here !!!

The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it can be
reproduced [1].

It's also possible that folio (A) is swapped in, and swapped out again
after the filemap_get_folio lookup, in such case folio (A) may stay in
swap cache so it needs to be moved too. In this case we should also try
again so kernel won't miss a folio move.

Fix this by checking if the folio is the valid swap cache folio after
acquiring the folio lock, and checking the swap cache again after
acquiring the src_pte lock.

SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
we don't need to worry about that since folios only might get exposed to
swap cache in the swap out path, and it's covered in this patch too by
checking the swap cache again after acquiring src_pte lock.

Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..a1564d205dfb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -15,6 +15,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
 			 struct folio *src_folio)
 {
+	swp_entry_t entry;
+
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 	if (src_folio) {
 		folio_move_anon_rmap(src_folio, dst_vma);
 		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	} else {
+		/*
+		 * Check again after acquiring the src_pte lock. Or we might
+		 * miss a new loaded swap cache folio.
+		 */
+		entry = pte_to_swp_entry(orig_src_pte);
+		src_folio = filemap_get_folio(swap_address_space(entry),
+					      swap_cache_index(entry));
+		if (!IS_ERR_OR_NULL(src_folio)) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			folio_put(src_folio);
+			return -EAGAIN;
+		}
 	}
 
 	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				folio_lock(src_folio);
 				goto retry;
 			}
+			/*
+			 * Check if the folio still belongs to the target swap entry after
+			 * acquiring the lock. Folio can be freed in the swap cache while
+			 * not locked.
+			 */
+			if (unlikely(!folio_test_swapcache(folio) ||
+				     entry.val != folio->swap.val)) {
+				err = -EAGAIN;
+				goto out;
+			}
 		}
 		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
 				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-30 20:17 [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
@ 2025-05-30 23:40 ` Lokesh Gidra
  2025-05-31  3:37   ` Barry Song
  2025-05-31  6:22   ` Kairui Song
  2025-05-31  4:04 ` Barry Song
  1 sibling, 2 replies; 16+ messages in thread
From: Lokesh Gidra @ 2025-05-30 23:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, stable, linux-kernel

On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> lookup, and try to move the found folio to the faulting vma when.
> Currently, it relies on the PTE value check to ensure the moved folio
> still belongs to the src swap entry, which turns out is not reliable.
>
> While working and reviewing the swap table series with Barry, following
> existing race is observed and reproduced [1]:
>
> ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
>  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < Somehow interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is just a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B could use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < Somehow interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry pte
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
>
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it can be
> reproduced [1].

Thanks for catching and fixing this. Just to clarify a few things
about your reproducer:
1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
MAP_PRIVATE will work as well?
2. You mentioned that the 'current dir is on a block device'. Are you
indicating that if we are using zram for swap then it doesn't
reproduce?

>
> It's also possible that folio (A) is swapped in, and swapped out again
> after the filemap_get_folio lookup, in such case folio (A) may stay in
> swap cache so it needs to be moved too. In this case we should also try
> again so kernel won't miss a folio move.
>
> Fix this by checking if the folio is the valid swap cache folio after
> acquiring the folio lock, and checking the swap cache again after
> acquiring the src_pte lock.
>
> SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> we don't need to worry about that since folios only might get exposed to
> swap cache in the swap out path, and it's covered in this patch too by
> checking the swap cache again after acquiring src_pte lock.
>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..a1564d205dfb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -15,6 +15,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
I guess you mistakenly left it from your reproducer code :)
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
>                          struct folio *src_folio)
>  {
> +       swp_entry_t entry;
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               /*
> +                * Check again after acquiring the src_pte lock. Or we might
> +                * miss a new loaded swap cache folio.
> +                */
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               src_folio = filemap_get_folio(swap_address_space(entry),
> +                                             swap_cache_index(entry));

Given the non-trivial overhead of filemap_get_folio(), do you think it
will work if filemap_get_filio() was only once after locking src_ptl?
Please correct me if my assumption about the overhead is wrong.

> +               if (!IS_ERR_OR_NULL(src_folio)) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       folio_put(src_folio);
> +                       return -EAGAIN;
> +               }
>         }
>
>         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                                 folio_lock(src_folio);
>                                 goto retry;
>                         }
> +                       /*
> +                        * Check if the folio still belongs to the target swap entry after
> +                        * acquiring the lock. Folio can be freed in the swap cache while
> +                        * not locked.
> +                        */
> +                       if (unlikely(!folio_test_swapcache(folio) ||
> +                                    entry.val != folio->swap.val)) {
> +                               err = -EAGAIN;
> +                               goto out;
> +                       }

To avoid further increasing move_pages_pte() size, I recommend moving
the entire 'pte not present' case into move_swap_pte(), and maybe
returning some positive integer (or something more appropriate) to
handle the retry case. And then in move_swap_pte(), as suggested
above, you can do filemap_get_folio only once after locking ptl.

I think this will fix the bug as well as improve the code's organization.

>                 }
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-30 23:40 ` Lokesh Gidra
@ 2025-05-31  3:37   ` Barry Song
  2025-05-31  6:25     ` Kairui Song
  2025-05-31  6:22   ` Kairui Song
  1 sibling, 1 reply; 16+ messages in thread
From: Barry Song @ 2025-05-31  3:37 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Kairui Song, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > lookup, and try to move the found folio to the faulting vma when.
> > Currently, it relies on the PTE value check to ensure the moved folio
> > still belongs to the src swap entry, which turns out is not reliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing race is observed and reproduced [1]:
> >
> > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < Somehow interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is just a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B could use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < Somehow interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry pte
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it can be
> > reproduced [1].
>
> Thanks for catching and fixing this. Just to clarify a few things
> about your reproducer:
> 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> MAP_PRIVATE will work as well?
> 2. You mentioned that the 'current dir is on a block device'. Are you
> indicating that if we are using zram for swap then it doesn't
> reproduce?
>
> >
> > It's also possible that folio (A) is swapped in, and swapped out again
> > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > swap cache so it needs to be moved too. In this case we should also try
> > again so kernel won't miss a folio move.
> >
> > Fix this by checking if the folio is the valid swap cache folio after
> > acquiring the folio lock, and checking the swap cache again after
> > acquiring the src_pte lock.
> >
> > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > we don't need to worry about that since folios only might get exposed to
> > swap cache in the swap out path, and it's covered in this patch too by
> > checking the swap cache again after acquiring src_pte lock.
> >
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..a1564d205dfb 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/delay.h>
> I guess you mistakenly left it from your reproducer code :)
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> >                          struct folio *src_folio)
> >  {
> > +       swp_entry_t entry;
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               /*
> > +                * Check again after acquiring the src_pte lock. Or we might
> > +                * miss a new loaded swap cache folio.
> > +                */
> > +               entry = pte_to_swp_entry(orig_src_pte);
> > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > +                                             swap_cache_index(entry));
>
> Given the non-trivial overhead of filemap_get_folio(), do you think it
> will work if filemap_get_filio() was only once after locking src_ptl?
> Please correct me if my assumption about the overhead is wrong.

not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
can we safely folio_move_anon_rmap + src_folio->index while not holding
folio lock?

>
> > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       folio_put(src_folio);
> > +                       return -EAGAIN;
> > +               }
> >         }
> >
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                                 folio_lock(src_folio);
> >                                 goto retry;
> >                         }
> > +                       /*
> > +                        * Check if the folio still belongs to the target swap entry after
> > +                        * acquiring the lock. Folio can be freed in the swap cache while
> > +                        * not locked.
> > +                        */
> > +                       if (unlikely(!folio_test_swapcache(folio) ||
> > +                                    entry.val != folio->swap.val)) {
> > +                               err = -EAGAIN;
> > +                               goto out;
> > +                       }
>
> To avoid further increasing move_pages_pte() size, I recommend moving
> the entire 'pte not present' case into move_swap_pte(), and maybe
> returning some positive integer (or something more appropriate) to
> handle the retry case. And then in move_swap_pte(), as suggested
> above, you can do filemap_get_folio only once after locking ptl.
>
> I think this will fix the bug as well as improve the code's organization.
>
> >                 }
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > --
> > 2.49.0
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-30 20:17 [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
  2025-05-30 23:40 ` Lokesh Gidra
@ 2025-05-31  4:04 ` Barry Song
  2025-05-31  4:41   ` Barry Song
  1 sibling, 1 reply; 16+ messages in thread
From: Barry Song @ 2025-05-31  4:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> lookup, and try to move the found folio to the faulting vma when.
> Currently, it relies on the PTE value check to ensure the moved folio
> still belongs to the src swap entry, which turns out is not reliable.
>
> While working and reviewing the swap table series with Barry, following
> existing race is observed and reproduced [1]:
>
> ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
>  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
>
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < Somehow interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is just a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B could use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < Somehow interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry pte
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
>
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it can be
> reproduced [1].
>
> It's also possible that folio (A) is swapped in, and swapped out again
> after the filemap_get_folio lookup, in such case folio (A) may stay in
> swap cache so it needs to be moved too. In this case we should also try
> again so kernel won't miss a folio move.
>
> Fix this by checking if the folio is the valid swap cache folio after
> acquiring the folio lock, and checking the swap cache again after
> acquiring the src_pte lock.
>
> SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> we don't need to worry about that since folios only might get exposed to
> swap cache in the swap out path, and it's covered in this patch too by
> checking the swap cache again after acquiring src_pte lock.
>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..a1564d205dfb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -15,6 +15,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
>                          struct folio *src_folio)
>  {
> +       swp_entry_t entry;
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               /*
> +                * Check again after acquiring the src_pte lock. Or we might
> +                * miss a new loaded swap cache folio.
> +                */
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               src_folio = filemap_get_folio(swap_address_space(entry),
> +                                             swap_cache_index(entry));
> +               if (!IS_ERR_OR_NULL(src_folio)) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       folio_put(src_folio);
> +                       return -EAGAIN;
> +               }
>         }

step 1: src pte points to a swap entry without swapcache
step 2: we call move_swap_pte()
step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
have swapcache again - for non-sync/non-zRAM swap device;
step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
step 5: do_swap_page() for src_pte holds the ptl and found pte has
been cleared in
            step 4; pte_same() returns false;
step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
            if the swapcache folio is dropped, it seems everything is fine.

So the real issue is that do_swap_page() doesn’t drop the new swapcache
even when pte_same() returns false? That means the dst_pte swap-in
can still hit the swap cache entry brought in by the src_pte's swap-in?

>
>         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                                 folio_lock(src_folio);
>                                 goto retry;
>                         }
> +                       /*
> +                        * Check if the folio still belongs to the target swap entry after
> +                        * acquiring the lock. Folio can be freed in the swap cache while
> +                        * not locked.
> +                        */
> +                       if (unlikely(!folio_test_swapcache(folio) ||
> +                                    entry.val != folio->swap.val)) {
> +                               err = -EAGAIN;
> +                               goto out;
> +                       }
>                 }
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> --
> 2.49.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  4:04 ` Barry Song
@ 2025-05-31  4:41   ` Barry Song
  2025-05-31  6:10     ` Lokesh Gidra
  2025-05-31  6:54     ` Kairui Song
  0 siblings, 2 replies; 16+ messages in thread
From: Barry Song @ 2025-05-31  4:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Sat, May 31, 2025 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > lookup, and try to move the found folio to the faulting vma when.
> > Currently, it relies on the PTE value check to ensure the moved folio
> > still belongs to the src swap entry, which turns out is not reliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing race is observed and reproduced [1]:
> >
> > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < Somehow interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is just a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B could use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < Somehow interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry pte
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it can be
> > reproduced [1].
> >
> > It's also possible that folio (A) is swapped in, and swapped out again
> > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > swap cache so it needs to be moved too. In this case we should also try
> > again so kernel won't miss a folio move.
> >
> > Fix this by checking if the folio is the valid swap cache folio after
> > acquiring the folio lock, and checking the swap cache again after
> > acquiring the src_pte lock.
> >
> > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > we don't need to worry about that since folios only might get exposed to
> > swap cache in the swap out path, and it's covered in this patch too by
> > checking the swap cache again after acquiring src_pte lock.
> >
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..a1564d205dfb 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/delay.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> >                          struct folio *src_folio)
> >  {
> > +       swp_entry_t entry;
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               /*
> > +                * Check again after acquiring the src_pte lock. Or we might
> > +                * miss a new loaded swap cache folio.
> > +                */
> > +               entry = pte_to_swp_entry(orig_src_pte);
> > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > +                                             swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       folio_put(src_folio);
> > +                       return -EAGAIN;
> > +               }
> >         }
>
> step 1: src pte points to a swap entry without swapcache
> step 2: we call move_swap_pte()
> step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
> have swapcache again - for non-sync/non-zRAM swap device;
> step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
> step 5: do_swap_page() for src_pte holds the ptl and found pte has
> been cleared in
>             step 4; pte_same() returns false;
> step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
>             if the swapcache folio is dropped, it seems everything is fine.
>
> So the real issue is that do_swap_page() doesn’t drop the new swapcache
> even when pte_same() returns false? That means the dst_pte swap-in
> can still hit the swap cache entry brought in by the src_pte's swap-in?

It seems also possible for the sync zRAM device.

 step 1: src pte points to a swap entry S without swapcache
 step 2: we call move_swap_pte()
 step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
S is freed.
             -- for zRAM;
 step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
             adds folio to swapcache due to swapout;
 step 5: move_swap_pte() gets ptl and finds page tables are stable
since swap-out
             happens to have the same swap slot as step1;
 step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.

Yep, we really need to re-check pte for swapcache after holding PTL.

>
> >
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                                 folio_lock(src_folio);
> >                                 goto retry;
> >                         }
> > +                       /*
> > +                        * Check if the folio still belongs to the target swap entry after
> > +                        * acquiring the lock. Folio can be freed in the swap cache while
> > +                        * not locked.
> > +                        */
> > +                       if (unlikely(!folio_test_swapcache(folio) ||
> > +                                    entry.val != folio->swap.val)) {
> > +                               err = -EAGAIN;
> > +                               goto out;
> > +                       }
> >                 }
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > --
> > 2.49.0
> >
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  4:41   ` Barry Song
@ 2025-05-31  6:10     ` Lokesh Gidra
  2025-05-31  6:36       ` Kairui Song
  2025-05-31  6:54     ` Kairui Song
  1 sibling, 1 reply; 16+ messages in thread
From: Lokesh Gidra @ 2025-05-31  6:10 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Fri, May 30, 2025 at 9:42 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > lookup, and try to move the found folio to the faulting vma when.
> > > Currently, it relies on the PTE value check to ensure the moved folio
> > > still belongs to the src swap entry, which turns out is not reliable.
> > >
> > > While working and reviewing the swap table series with Barry, following
> > > existing race is observed and reproduced [1]:
> > >
> > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > >
> > > CPU1                               CPU2
> > > userfaultfd_move
> > >   move_pages_pte()
> > >     entry = pte_to_swp_entry(orig_src_pte);
> > >     // Here it got entry = S1
> > >     ... < Somehow interrupted> ...
> > >                                    <swapin src_pte, alloc and use folio A>
> > >                                    // folio A is just a new allocated folio
> > >                                    // and get installed into src_pte
> > >                                    <frees swap entry S1>
> > >                                    // src_pte now points to folio A, S1
> > >                                    // has swap count == 0, it can be freed
> > >                                    // by folio_swap_swap or swap
> > >                                    // allocator's reclaim.
> > >                                    <try to swap out another folio B>
> > >                                    // folio B is a folio in another VMA.
> > >                                    <put folio B to swap cache using S1 >
> > >                                    // S1 is freed, folio B could use it
> > >                                    // for swap out with no problem.
> > >                                    ...
> > >     folio = filemap_get_folio(S1)
> > >     // Got folio B here !!!
> > >     ... < Somehow interrupted again> ...
> > >                                    <swapin folio B and free S1>
> > >                                    // Now S1 is free to be used again.
> > >                                    <swapout src_pte & folio A using S1>
> > >                                    // Now src_pte is a swap entry pte
> > >                                    // holding S1 again.
> > >     folio_trylock(folio)
> > >     move_swap_pte
> > >       double_pt_lock
> > >       is_pte_pages_stable
> > >       // Check passed because src_pte == S1
> > >       folio_move_anon_rmap(...)
> > >       // Moved invalid folio B here !!!
> > >
> > > The race window is very short and requires multiple collisions of
> > > multiple rare events, so it's very unlikely to happen, but with a
> > > deliberately constructed reproducer and increased time window, it can be
> > > reproduced [1].
> > >
> > > It's also possible that folio (A) is swapped in, and swapped out again
> > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > swap cache so it needs to be moved too. In this case we should also try
> > > again so kernel won't miss a folio move.
> > >
> > > Fix this by checking if the folio is the valid swap cache folio after
> > > acquiring the folio lock, and checking the swap cache again after
> > > acquiring the src_pte lock.
> > >
> > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > we don't need to worry about that since folios only might get exposed to
> > > swap cache in the swap out path, and it's covered in this patch too by
> > > checking the swap cache again after acquiring src_pte lock.
> > >
> > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..a1564d205dfb 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/shmem_fs.h>
> > > +#include <linux/delay.h>
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > >                          struct folio *src_folio)
> > >  {
> > > +       swp_entry_t entry;
> > > +
> > >         double_pt_lock(dst_ptl, src_ptl);
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >         if (src_folio) {
> > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +       } else {
> > > +               /*
> > > +                * Check again after acquiring the src_pte lock. Or we might
> > > +                * miss a new loaded swap cache folio.
> > > +                */
> > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > +                                             swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > +                       folio_put(src_folio);
> > > +                       return -EAGAIN;
> > > +               }
> > >         }
> >
> > step 1: src pte points to a swap entry without swapcache
> > step 2: we call move_swap_pte()
> > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
> > have swapcache again - for non-sync/non-zRAM swap device;
> > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
> > step 5: do_swap_page() for src_pte holds the ptl and found pte has
> > been cleared in
> >             step 4; pte_same() returns false;
> > step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
> >             if the swapcache folio is dropped, it seems everything is fine.
> >
> > So the real issue is that do_swap_page() doesn’t drop the new swapcache
> > even when pte_same() returns false? That means the dst_pte swap-in
> > can still hit the swap cache entry brought in by the src_pte's swap-in?
>
> It seems also possible for the sync zRAM device.
>
>  step 1: src pte points to a swap entry S without swapcache
>  step 2: we call move_swap_pte()
>  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> S is freed.
>              -- for zRAM;
>  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
>              adds folio to swapcache due to swapout;
>  step 5: move_swap_pte() gets ptl and finds page tables are stable
> since swap-out
>              happens to have the same swap slot as step1;
>  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
>
> Yep, we really need to re-check pte for swapcache after holding PTL.
>
Any idea what is the overhead of filemap_get_folio()? In particular,
when no folio exists for the given entry, how costly is it?

Given how rare it is, unless filemap_get_folio() is cheap for 'no
folio' case, if there is no way to avoid calling it after holding PTL,
then we should do it only once at that point. If a folio is returned,
then like in the pte_present() case, we attempt folio_trylock() with
PTL held, otherwise do the retry dance.
> >
> > >
> > >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                                 folio_lock(src_folio);
> > >                                 goto retry;
> > >                         }
> > > +                       /*
> > > +                        * Check if the folio still belongs to the target swap entry after
> > > +                        * acquiring the lock. Folio can be freed in the swap cache while
> > > +                        * not locked.
> > > +                        */
> > > +                       if (unlikely(!folio_test_swapcache(folio) ||
> > > +                                    entry.val != folio->swap.val)) {
> > > +                               err = -EAGAIN;
> > > +                               goto out;
> > > +                       }
> > >                 }
> > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > --
> > > 2.49.0
> > >
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-30 23:40 ` Lokesh Gidra
  2025-05-31  3:37   ` Barry Song
@ 2025-05-31  6:22   ` Kairui Song
  1 sibling, 0 replies; 16+ messages in thread
From: Kairui Song @ 2025-05-31  6:22 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: linux-mm, Andrew Morton, Barry Song, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, stable, linux-kernel

On Sat, May 31, 2025 at 7:42 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > lookup, and try to move the found folio to the faulting vma when.
> > Currently, it relies on the PTE value check to ensure the moved folio
> > still belongs to the src swap entry, which turns out is not reliable.
> >
> > While working and reviewing the swap table series with Barry, following
> > existing race is observed and reproduced [1]:
> >
> > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> >
> > CPU1                               CPU2
> > userfaultfd_move
> >   move_pages_pte()
> >     entry = pte_to_swp_entry(orig_src_pte);
> >     // Here it got entry = S1
> >     ... < Somehow interrupted> ...
> >                                    <swapin src_pte, alloc and use folio A>
> >                                    // folio A is just a new allocated folio
> >                                    // and get installed into src_pte
> >                                    <frees swap entry S1>
> >                                    // src_pte now points to folio A, S1
> >                                    // has swap count == 0, it can be freed
> >                                    // by folio_swap_swap or swap
> >                                    // allocator's reclaim.
> >                                    <try to swap out another folio B>
> >                                    // folio B is a folio in another VMA.
> >                                    <put folio B to swap cache using S1 >
> >                                    // S1 is freed, folio B could use it
> >                                    // for swap out with no problem.
> >                                    ...
> >     folio = filemap_get_folio(S1)
> >     // Got folio B here !!!
> >     ... < Somehow interrupted again> ...
> >                                    <swapin folio B and free S1>
> >                                    // Now S1 is free to be used again.
> >                                    <swapout src_pte & folio A using S1>
> >                                    // Now src_pte is a swap entry pte
> >                                    // holding S1 again.
> >     folio_trylock(folio)
> >     move_swap_pte
> >       double_pt_lock
> >       is_pte_pages_stable
> >       // Check passed because src_pte == S1
> >       folio_move_anon_rmap(...)
> >       // Moved invalid folio B here !!!
> >
> > The race window is very short and requires multiple collisions of
> > multiple rare events, so it's very unlikely to happen, but with a
> > deliberately constructed reproducer and increased time window, it can be
> > reproduced [1].
>
> Thanks for catching and fixing this. Just to clarify a few things
> about your reproducer:
> 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> MAP_PRIVATE will work as well?

Hi, I used MAP_SHARED just to prevent vma merging, so folio B and
folio A belong to different vma, so the folio_move_anon_rmap will
cause a real problem. The race is not related to the map flag.

> 2. You mentioned that the 'current dir is on a block device'. Are you
> indicating that if we are using zram for swap then it doesn't
> reproduce?

ZRAM may also trigger the race, with a slightly different race window,
but the race window is even shorter so I can't reproduce with that.

>
> >
> > It's also possible that folio (A) is swapped in, and swapped out again
> > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > swap cache so it needs to be moved too. In this case we should also try
> > again so kernel won't miss a folio move.
> >
> > Fix this by checking if the folio is the valid swap cache folio after
> > acquiring the folio lock, and checking the swap cache again after
> > acquiring the src_pte lock.
> >
> > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > we don't need to worry about that since folios only might get exposed to
> > swap cache in the swap out path, and it's covered in this patch too by
> > checking the swap cache again after acquiring src_pte lock.
> >
> > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..a1564d205dfb 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/delay.h>
> I guess you mistakenly left it from your reproducer code :)

Ah, yes, I'll drop this :)

> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> >                          struct folio *src_folio)
> >  {
> > +       swp_entry_t entry;
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               /*
> > +                * Check again after acquiring the src_pte lock. Or we might
> > +                * miss a new loaded swap cache folio.
> > +                */
> > +               entry = pte_to_swp_entry(orig_src_pte);
> > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > +                                             swap_cache_index(entry));
>
> Given the non-trivial overhead of filemap_get_folio(), do you think it
> will work if filemap_get_filio() was only once after locking src_ptl?
> Please correct me if my assumption about the overhead is wrong.

No we can't do the filemap_get_filio after locking src_ptl, moving the
folio requires locking it, lock a folio inside a PTE lock looks a bad
idea... So I just added a lockless lookup and fallback to try again if
found one.

The overhead should be low I think, it's a lockless xarray lookup and
when I was doing many profiling of SWAP performance optimizations,
Xarray look up is not a heavy burden.

>
> > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       folio_put(src_folio);
> > +                       return -EAGAIN;
> > +               }
> >         }
> >
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                                 folio_lock(src_folio);
> >                                 goto retry;
> >                         }
> > +                       /*
> > +                        * Check if the folio still belongs to the target swap entry after
> > +                        * acquiring the lock. Folio can be freed in the swap cache while
> > +                        * not locked.
> > +                        */
> > +                       if (unlikely(!folio_test_swapcache(folio) ||
> > +                                    entry.val != folio->swap.val)) {
> > +                               err = -EAGAIN;
> > +                               goto out;
> > +                       }
>
> To avoid further increasing move_pages_pte() size, I recommend moving
> the entire 'pte not present' case into move_swap_pte(), and maybe
> returning some positive integer (or something more appropriate) to
> handle the retry case. And then in move_swap_pte(), as suggested
> above, you can do filemap_get_folio only once after locking ptl.
>
> I think this will fix the bug as well as improve the code's organization.
>
> >                 }
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > --
> > 2.49.0
> >
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  3:37   ` Barry Song
@ 2025-05-31  6:25     ` Kairui Song
  2025-05-31  6:35       ` Barry Song
  0 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2025-05-31  6:25 UTC (permalink / raw)
  To: Barry Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > lookup, and try to move the found folio to the faulting vma when.
> > > Currently, it relies on the PTE value check to ensure the moved folio
> > > still belongs to the src swap entry, which turns out is not reliable.
> > >
> > > While working and reviewing the swap table series with Barry, following
> > > existing race is observed and reproduced [1]:
> > >
> > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > >
> > > CPU1                               CPU2
> > > userfaultfd_move
> > >   move_pages_pte()
> > >     entry = pte_to_swp_entry(orig_src_pte);
> > >     // Here it got entry = S1
> > >     ... < Somehow interrupted> ...
> > >                                    <swapin src_pte, alloc and use folio A>
> > >                                    // folio A is just a new allocated folio
> > >                                    // and get installed into src_pte
> > >                                    <frees swap entry S1>
> > >                                    // src_pte now points to folio A, S1
> > >                                    // has swap count == 0, it can be freed
> > >                                    // by folio_swap_swap or swap
> > >                                    // allocator's reclaim.
> > >                                    <try to swap out another folio B>
> > >                                    // folio B is a folio in another VMA.
> > >                                    <put folio B to swap cache using S1 >
> > >                                    // S1 is freed, folio B could use it
> > >                                    // for swap out with no problem.
> > >                                    ...
> > >     folio = filemap_get_folio(S1)
> > >     // Got folio B here !!!
> > >     ... < Somehow interrupted again> ...
> > >                                    <swapin folio B and free S1>
> > >                                    // Now S1 is free to be used again.
> > >                                    <swapout src_pte & folio A using S1>
> > >                                    // Now src_pte is a swap entry pte
> > >                                    // holding S1 again.
> > >     folio_trylock(folio)
> > >     move_swap_pte
> > >       double_pt_lock
> > >       is_pte_pages_stable
> > >       // Check passed because src_pte == S1
> > >       folio_move_anon_rmap(...)
> > >       // Moved invalid folio B here !!!
> > >
> > > The race window is very short and requires multiple collisions of
> > > multiple rare events, so it's very unlikely to happen, but with a
> > > deliberately constructed reproducer and increased time window, it can be
> > > reproduced [1].
> >
> > Thanks for catching and fixing this. Just to clarify a few things
> > about your reproducer:
> > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > MAP_PRIVATE will work as well?
> > 2. You mentioned that the 'current dir is on a block device'. Are you
> > indicating that if we are using zram for swap then it doesn't
> > reproduce?
> >
> > >
> > > It's also possible that folio (A) is swapped in, and swapped out again
> > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > swap cache so it needs to be moved too. In this case we should also try
> > > again so kernel won't miss a folio move.
> > >
> > > Fix this by checking if the folio is the valid swap cache folio after
> > > acquiring the folio lock, and checking the swap cache again after
> > > acquiring the src_pte lock.
> > >
> > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > we don't need to worry about that since folios only might get exposed to
> > > swap cache in the swap out path, and it's covered in this patch too by
> > > checking the swap cache again after acquiring src_pte lock.
> > >
> > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..a1564d205dfb 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/shmem_fs.h>
> > > +#include <linux/delay.h>
> > I guess you mistakenly left it from your reproducer code :)
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > >                          struct folio *src_folio)
> > >  {
> > > +       swp_entry_t entry;
> > > +
> > >         double_pt_lock(dst_ptl, src_ptl);
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >         if (src_folio) {
> > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +       } else {
> > > +               /*
> > > +                * Check again after acquiring the src_pte lock. Or we might
> > > +                * miss a new loaded swap cache folio.
> > > +                */
> > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > +                                             swap_cache_index(entry));
> >
> > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > will work if filemap_get_filio() was only once after locking src_ptl?
> > Please correct me if my assumption about the overhead is wrong.
>
> not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> can we safely folio_move_anon_rmap + src_folio->index while not holding
> folio lock?

I think no, we can't even make sure the folio is still in the swap
cache, so it can be a freed folio that does not belong to any VMA
while not holding the folio lock.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  6:25     ` Kairui Song
@ 2025-05-31  6:35       ` Barry Song
  2025-05-31  7:00         ` Kairui Song
  0 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2025-05-31  6:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > lookup, and try to move the found folio to the faulting vma when.
> > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > still belongs to the src swap entry, which turns out is not reliable.
> > > >
> > > > While working and reviewing the swap table series with Barry, following
> > > > existing race is observed and reproduced [1]:
> > > >
> > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > >
> > > > CPU1                               CPU2
> > > > userfaultfd_move
> > > >   move_pages_pte()
> > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > >     // Here it got entry = S1
> > > >     ... < Somehow interrupted> ...
> > > >                                    <swapin src_pte, alloc and use folio A>
> > > >                                    // folio A is just a new allocated folio
> > > >                                    // and get installed into src_pte
> > > >                                    <frees swap entry S1>
> > > >                                    // src_pte now points to folio A, S1
> > > >                                    // has swap count == 0, it can be freed
> > > >                                    // by folio_swap_swap or swap
> > > >                                    // allocator's reclaim.
> > > >                                    <try to swap out another folio B>
> > > >                                    // folio B is a folio in another VMA.
> > > >                                    <put folio B to swap cache using S1 >
> > > >                                    // S1 is freed, folio B could use it
> > > >                                    // for swap out with no problem.
> > > >                                    ...
> > > >     folio = filemap_get_folio(S1)
> > > >     // Got folio B here !!!
> > > >     ... < Somehow interrupted again> ...
> > > >                                    <swapin folio B and free S1>
> > > >                                    // Now S1 is free to be used again.
> > > >                                    <swapout src_pte & folio A using S1>
> > > >                                    // Now src_pte is a swap entry pte
> > > >                                    // holding S1 again.
> > > >     folio_trylock(folio)
> > > >     move_swap_pte
> > > >       double_pt_lock
> > > >       is_pte_pages_stable
> > > >       // Check passed because src_pte == S1
> > > >       folio_move_anon_rmap(...)
> > > >       // Moved invalid folio B here !!!
> > > >
> > > > The race window is very short and requires multiple collisions of
> > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > deliberately constructed reproducer and increased time window, it can be
> > > > reproduced [1].
> > >
> > > Thanks for catching and fixing this. Just to clarify a few things
> > > about your reproducer:
> > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > MAP_PRIVATE will work as well?
> > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > indicating that if we are using zram for swap then it doesn't
> > > reproduce?
> > >
> > > >
> > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > swap cache so it needs to be moved too. In this case we should also try
> > > > again so kernel won't miss a folio move.
> > > >
> > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > acquiring the folio lock, and checking the swap cache again after
> > > > acquiring the src_pte lock.
> > > >
> > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > we don't need to worry about that since folios only might get exposed to
> > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > checking the swap cache again after acquiring src_pte lock.
> > > >
> > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > >  1 file changed, 26 insertions(+)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..a1564d205dfb 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -15,6 +15,7 @@
> > > >  #include <linux/mmu_notifier.h>
> > > >  #include <linux/hugetlb.h>
> > > >  #include <linux/shmem_fs.h>
> > > > +#include <linux/delay.h>
> > > I guess you mistakenly left it from your reproducer code :)
> > > >  #include <asm/tlbflush.h>
> > > >  #include <asm/tlb.h>
> > > >  #include "internal.h"
> > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > >                          struct folio *src_folio)
> > > >  {
> > > > +       swp_entry_t entry;
> > > > +
> > > >         double_pt_lock(dst_ptl, src_ptl);
> > > >
> > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >         if (src_folio) {
> > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > +       } else {
> > > > +               /*
> > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > +                * miss a new loaded swap cache folio.
> > > > +                */
> > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > +                                             swap_cache_index(entry));
> > >
> > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > Please correct me if my assumption about the overhead is wrong.
> >
> > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > folio lock?
>
> I think no, we can't even make sure the folio is still in the swap
> cache, so it can be a freed folio that does not belong to any VMA
> while not holding the folio lock.

Right, but will the following be sufficient, given that we don’t really
care about the folio—only whether there’s new cache?

if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
             double_pt_unlock(dst_ptl, src_ptl);
             return -EAGAIN;
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  6:10     ` Lokesh Gidra
@ 2025-05-31  6:36       ` Kairui Song
  2025-05-31  6:54         ` Barry Song
  0 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2025-05-31  6:36 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Barry Song, linux-mm, Andrew Morton, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, stable, linux-kernel

On Sat, May 31, 2025 at 2:10 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Fri, May 30, 2025 at 9:42 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > lookup, and try to move the found folio to the faulting vma when.
> > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > still belongs to the src swap entry, which turns out is not reliable.
> > > >
> > > > While working and reviewing the swap table series with Barry, following
> > > > existing race is observed and reproduced [1]:
> > > >
> > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > >
> > > > CPU1                               CPU2
> > > > userfaultfd_move
> > > >   move_pages_pte()
> > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > >     // Here it got entry = S1
> > > >     ... < Somehow interrupted> ...
> > > >                                    <swapin src_pte, alloc and use folio A>
> > > >                                    // folio A is just a new allocated folio
> > > >                                    // and get installed into src_pte
> > > >                                    <frees swap entry S1>
> > > >                                    // src_pte now points to folio A, S1
> > > >                                    // has swap count == 0, it can be freed
> > > >                                    // by folio_swap_swap or swap
> > > >                                    // allocator's reclaim.
> > > >                                    <try to swap out another folio B>
> > > >                                    // folio B is a folio in another VMA.
> > > >                                    <put folio B to swap cache using S1 >
> > > >                                    // S1 is freed, folio B could use it
> > > >                                    // for swap out with no problem.
> > > >                                    ...
> > > >     folio = filemap_get_folio(S1)
> > > >     // Got folio B here !!!
> > > >     ... < Somehow interrupted again> ...
> > > >                                    <swapin folio B and free S1>
> > > >                                    // Now S1 is free to be used again.
> > > >                                    <swapout src_pte & folio A using S1>
> > > >                                    // Now src_pte is a swap entry pte
> > > >                                    // holding S1 again.
> > > >     folio_trylock(folio)
> > > >     move_swap_pte
> > > >       double_pt_lock
> > > >       is_pte_pages_stable
> > > >       // Check passed because src_pte == S1
> > > >       folio_move_anon_rmap(...)
> > > >       // Moved invalid folio B here !!!
> > > >
> > > > The race window is very short and requires multiple collisions of
> > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > deliberately constructed reproducer and increased time window, it can be
> > > > reproduced [1].
> > > >
> > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > swap cache so it needs to be moved too. In this case we should also try
> > > > again so kernel won't miss a folio move.
> > > >
> > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > acquiring the folio lock, and checking the swap cache again after
> > > > acquiring the src_pte lock.
> > > >
> > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > we don't need to worry about that since folios only might get exposed to
> > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > checking the swap cache again after acquiring src_pte lock.
> > > >
> > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > >  1 file changed, 26 insertions(+)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..a1564d205dfb 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -15,6 +15,7 @@
> > > >  #include <linux/mmu_notifier.h>
> > > >  #include <linux/hugetlb.h>
> > > >  #include <linux/shmem_fs.h>
> > > > +#include <linux/delay.h>
> > > >  #include <asm/tlbflush.h>
> > > >  #include <asm/tlb.h>
> > > >  #include "internal.h"
> > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > >                          struct folio *src_folio)
> > > >  {
> > > > +       swp_entry_t entry;
> > > > +
> > > >         double_pt_lock(dst_ptl, src_ptl);
> > > >
> > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > >         if (src_folio) {
> > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > +       } else {
> > > > +               /*
> > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > +                * miss a new loaded swap cache folio.
> > > > +                */
> > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > +                                             swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > > +                       folio_put(src_folio);
> > > > +                       return -EAGAIN;
> > > > +               }
> > > >         }
> > >
> > > step 1: src pte points to a swap entry without swapcache
> > > step 2: we call move_swap_pte()
> > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
> > > have swapcache again - for non-sync/non-zRAM swap device;
> > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
> > > step 5: do_swap_page() for src_pte holds the ptl and found pte has
> > > been cleared in
> > >             step 4; pte_same() returns false;
> > > step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
> > >             if the swapcache folio is dropped, it seems everything is fine.
> > >
> > > So the real issue is that do_swap_page() doesn’t drop the new swapcache
> > > even when pte_same() returns false? That means the dst_pte swap-in
> > > can still hit the swap cache entry brought in by the src_pte's swap-in?
> >
> > It seems also possible for the sync zRAM device.
> >
> >  step 1: src pte points to a swap entry S without swapcache
> >  step 2: we call move_swap_pte()
> >  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> > S is freed.
> >              -- for zRAM;
> >  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
> >              adds folio to swapcache due to swapout;
> >  step 5: move_swap_pte() gets ptl and finds page tables are stable
> > since swap-out
> >              happens to have the same swap slot as step1;
> >  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
> >
> > Yep, we really need to re-check pte for swapcache after holding PTL.
> >
> Any idea what is the overhead of filemap_get_folio()? In particular,
> when no folio exists for the given entry, how costly is it?
>
> Given how rare it is, unless filemap_get_folio() is cheap for 'no
> folio' case, if there is no way to avoid calling it after holding PTL,
> then we should do it only once at that point. If a folio is returned,
> then like in the pte_present() case, we attempt folio_trylock() with
> PTL held, otherwise do the retry dance.

Yeah I think filemap_get_folio is cheap, each swap cache space is at
most 64M big, so it just walks at most three xa_nodes and returns, not
involving any synchronization or write.

The swap cache lookup will be even cheaper in the future to be just
checking one plain array element.

I can try to fix this with the folio_trylock inside the PTL lock
approach, maybe the code will be cleaner that way.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  4:41   ` Barry Song
  2025-05-31  6:10     ` Lokesh Gidra
@ 2025-05-31  6:54     ` Kairui Song
  1 sibling, 0 replies; 16+ messages in thread
From: Kairui Song @ 2025-05-31  6:54 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Peter Xu, Suren Baghdasaryan,
	Andrea Arcangeli, David Hildenbrand, Lokesh Gidra, stable,
	linux-kernel

On Sat, May 31, 2025 at 12:42 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > lookup, and try to move the found folio to the faulting vma when.
> > > Currently, it relies on the PTE value check to ensure the moved folio
> > > still belongs to the src swap entry, which turns out is not reliable.
> > >
> > > While working and reviewing the swap table series with Barry, following
> > > existing race is observed and reproduced [1]:
> > >
> > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > >
> > > CPU1                               CPU2
> > > userfaultfd_move
> > >   move_pages_pte()
> > >     entry = pte_to_swp_entry(orig_src_pte);
> > >     // Here it got entry = S1
> > >     ... < Somehow interrupted> ...
> > >                                    <swapin src_pte, alloc and use folio A>
> > >                                    // folio A is just a new allocated folio
> > >                                    // and get installed into src_pte
> > >                                    <frees swap entry S1>
> > >                                    // src_pte now points to folio A, S1
> > >                                    // has swap count == 0, it can be freed
> > >                                    // by folio_swap_swap or swap
> > >                                    // allocator's reclaim.
> > >                                    <try to swap out another folio B>
> > >                                    // folio B is a folio in another VMA.
> > >                                    <put folio B to swap cache using S1 >
> > >                                    // S1 is freed, folio B could use it
> > >                                    // for swap out with no problem.
> > >                                    ...
> > >     folio = filemap_get_folio(S1)
> > >     // Got folio B here !!!
> > >     ... < Somehow interrupted again> ...
> > >                                    <swapin folio B and free S1>
> > >                                    // Now S1 is free to be used again.
> > >                                    <swapout src_pte & folio A using S1>
> > >                                    // Now src_pte is a swap entry pte
> > >                                    // holding S1 again.
> > >     folio_trylock(folio)
> > >     move_swap_pte
> > >       double_pt_lock
> > >       is_pte_pages_stable
> > >       // Check passed because src_pte == S1
> > >       folio_move_anon_rmap(...)
> > >       // Moved invalid folio B here !!!
> > >
> > > The race window is very short and requires multiple collisions of
> > > multiple rare events, so it's very unlikely to happen, but with a
> > > deliberately constructed reproducer and increased time window, it can be
> > > reproduced [1].
> > >
> > > It's also possible that folio (A) is swapped in, and swapped out again
> > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > swap cache so it needs to be moved too. In this case we should also try
> > > again so kernel won't miss a folio move.
> > >
> > > Fix this by checking if the folio is the valid swap cache folio after
> > > acquiring the folio lock, and checking the swap cache again after
> > > acquiring the src_pte lock.
> > >
> > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > we don't need to worry about that since folios only might get exposed to
> > > swap cache in the swap out path, and it's covered in this patch too by
> > > checking the swap cache again after acquiring src_pte lock.
> > >
> > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..a1564d205dfb 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/shmem_fs.h>
> > > +#include <linux/delay.h>
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > >                          struct folio *src_folio)
> > >  {
> > > +       swp_entry_t entry;
> > > +
> > >         double_pt_lock(dst_ptl, src_ptl);
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > >         if (src_folio) {
> > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +       } else {
> > > +               /*
> > > +                * Check again after acquiring the src_pte lock. Or we might
> > > +                * miss a new loaded swap cache folio.
> > > +                */
> > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > +                                             swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > +                       folio_put(src_folio);
> > > +                       return -EAGAIN;
> > > +               }
> > >         }
> >
> > step 1: src pte points to a swap entry without swapcache
> > step 2: we call move_swap_pte()
> > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
> > have swapcache again - for non-sync/non-zRAM swap device;
> > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
> > step 5: do_swap_page() for src_pte holds the ptl and found pte has
> > been cleared in
> >             step 4; pte_same() returns false;
> > step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
> >             if the swapcache folio is dropped, it seems everything is fine.

Even if it's not dropped, it's fine, the folio doesn belong to any VMA
in this case, and we don't need to move it.

The problem is the swapin succeed before step 4, and another swapout
also succeeded before step 4, so src folio will be left in the swap
cache belonging to src VMA.

> >
> > So the real issue is that do_swap_page() doesn’t drop the new swapcache
> > even when pte_same() returns false? That means the dst_pte swap-in
> > can still hit the swap cache entry brought in by the src_pte's swap-in?
>
> It seems also possible for the sync zRAM device.
>
>  step 1: src pte points to a swap entry S without swapcache
>  step 2: we call move_swap_pte()
>  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> S is freed.
>              -- for zRAM;
>  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
>              adds folio to swapcache due to swapout;
>  step 5: move_swap_pte() gets ptl and finds page tables are stable
> since swap-out
>              happens to have the same swap slot as step1;
>  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.

Yes, that's exactly the case.

>
> Yep, we really need to re-check pte for swapcache after holding PTL.
>
> >
> > >
> > >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                                 folio_lock(src_folio);
> > >                                 goto retry;
> > >                         }
> > > +                       /*
> > > +                        * Check if the folio still belongs to the target swap entry after
> > > +                        * acquiring the lock. Folio can be freed in the swap cache while
> > > +                        * not locked.
> > > +                        */
> > > +                       if (unlikely(!folio_test_swapcache(folio) ||
> > > +                                    entry.val != folio->swap.val)) {
> > > +                               err = -EAGAIN;
> > > +                               goto out;
> > > +                       }
> > >                 }
> > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > --
> > > 2.49.0
> > >
> >
>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  6:36       ` Kairui Song
@ 2025-05-31  6:54         ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2025-05-31  6:54 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 6:36 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 2:10 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Fri, May 30, 2025 at 9:42 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, May 31, 2025 at 8:17 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > >
> > > > > While working and reviewing the swap table series with Barry, following
> > > > > existing race is observed and reproduced [1]:
> > > > >
> > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > >
> > > > > CPU1                               CPU2
> > > > > userfaultfd_move
> > > > >   move_pages_pte()
> > > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > > >     // Here it got entry = S1
> > > > >     ... < Somehow interrupted> ...
> > > > >                                    <swapin src_pte, alloc and use folio A>
> > > > >                                    // folio A is just a new allocated folio
> > > > >                                    // and get installed into src_pte
> > > > >                                    <frees swap entry S1>
> > > > >                                    // src_pte now points to folio A, S1
> > > > >                                    // has swap count == 0, it can be freed
> > > > >                                    // by folio_swap_swap or swap
> > > > >                                    // allocator's reclaim.
> > > > >                                    <try to swap out another folio B>
> > > > >                                    // folio B is a folio in another VMA.
> > > > >                                    <put folio B to swap cache using S1 >
> > > > >                                    // S1 is freed, folio B could use it
> > > > >                                    // for swap out with no problem.
> > > > >                                    ...
> > > > >     folio = filemap_get_folio(S1)
> > > > >     // Got folio B here !!!
> > > > >     ... < Somehow interrupted again> ...
> > > > >                                    <swapin folio B and free S1>
> > > > >                                    // Now S1 is free to be used again.
> > > > >                                    <swapout src_pte & folio A using S1>
> > > > >                                    // Now src_pte is a swap entry pte
> > > > >                                    // holding S1 again.
> > > > >     folio_trylock(folio)
> > > > >     move_swap_pte
> > > > >       double_pt_lock
> > > > >       is_pte_pages_stable
> > > > >       // Check passed because src_pte == S1
> > > > >       folio_move_anon_rmap(...)
> > > > >       // Moved invalid folio B here !!!
> > > > >
> > > > > The race window is very short and requires multiple collisions of
> > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > reproduced [1].
> > > > >
> > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > again so kernel won't miss a folio move.
> > > > >
> > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > acquiring the src_pte lock.
> > > > >
> > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > we don't need to worry about that since folios only might get exposed to
> > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > checking the swap cache again after acquiring src_pte lock.
> > > > >
> > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > ---
> > > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > >  1 file changed, 26 insertions(+)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -15,6 +15,7 @@
> > > > >  #include <linux/mmu_notifier.h>
> > > > >  #include <linux/hugetlb.h>
> > > > >  #include <linux/shmem_fs.h>
> > > > > +#include <linux/delay.h>
> > > > >  #include <asm/tlbflush.h>
> > > > >  #include <asm/tlb.h>
> > > > >  #include "internal.h"
> > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > >                          struct folio *src_folio)
> > > > >  {
> > > > > +       swp_entry_t entry;
> > > > > +
> > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > >
> > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > >         if (src_folio) {
> > > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > +       } else {
> > > > > +               /*
> > > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > > +                * miss a new loaded swap cache folio.
> > > > > +                */
> > > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > +                                             swap_cache_index(entry));
> > > > > +               if (!IS_ERR_OR_NULL(src_folio)) {
> > > > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > > > +                       folio_put(src_folio);
> > > > > +                       return -EAGAIN;
> > > > > +               }
> > > > >         }
> > > >
> > > > step 1: src pte points to a swap entry without swapcache
> > > > step 2: we call move_swap_pte()
> > > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap entry
> > > > have swapcache again - for non-sync/non-zRAM swap device;
> > > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_pte;
> > > > step 5: do_swap_page() for src_pte holds the ptl and found pte has
> > > > been cleared in
> > > >             step 4; pte_same() returns false;
> > > > step 6: do_swap_page() won't map src_pte to the new swapcache got from step 3;
> > > >             if the swapcache folio is dropped, it seems everything is fine.
> > > >
> > > > So the real issue is that do_swap_page() doesn’t drop the new swapcache
> > > > even when pte_same() returns false? That means the dst_pte swap-in
> > > > can still hit the swap cache entry brought in by the src_pte's swap-in?
> > >
> > > It seems also possible for the sync zRAM device.
> > >
> > >  step 1: src pte points to a swap entry S without swapcache
> > >  step 2: we call move_swap_pte()
> > >  step 3: someone swap-in src_pte by sync path, no swapcache; swap slot
> > > S is freed.
> > >              -- for zRAM;
> > >  step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1,
> > >              adds folio to swapcache due to swapout;
> > >  step 5: move_swap_pte() gets ptl and finds page tables are stable
> > > since swap-out
> > >              happens to have the same swap slot as step1;
> > >  step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio.
> > >
> > > Yep, we really need to re-check pte for swapcache after holding PTL.
> > >
> > Any idea what is the overhead of filemap_get_folio()? In particular,
> > when no folio exists for the given entry, how costly is it?
> >
> > Given how rare it is, unless filemap_get_folio() is cheap for 'no
> > folio' case, if there is no way to avoid calling it after holding PTL,
> > then we should do it only once at that point. If a folio is returned,
> > then like in the pte_present() case, we attempt folio_trylock() with
> > PTL held, otherwise do the retry dance.
>
> Yeah I think filemap_get_folio is cheap, each swap cache space is at
> most 64M big, so it just walks at most three xa_nodes and returns, not
> involving any synchronization or write.
>
> The swap cache lookup will be even cheaper in the future to be just
> checking one plain array element.
>
> I can try to fix this with the folio_trylock inside the PTL lock
> approach, maybe the code will be cleaner that way.

Using trylock in an atomic context and deferring to a full lock in a
non-atomic context might be considered a workaround — it feels a bit
hackish to me :-)

To follow that, the reader might need to understand the entire workflow
from kernel to userspace to grasp what's happening with the folio lock,
whereas the current code handles everything within the kernel, making it
more natural.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  6:35       ` Barry Song
@ 2025-05-31  7:00         ` Kairui Song
  2025-05-31  7:06           ` Barry Song
  0 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2025-05-31  7:00 UTC (permalink / raw)
  To: Barry Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > >
> > > > > While working and reviewing the swap table series with Barry, following
> > > > > existing race is observed and reproduced [1]:
> > > > >
> > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > >
> > > > > CPU1                               CPU2
> > > > > userfaultfd_move
> > > > >   move_pages_pte()
> > > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > > >     // Here it got entry = S1
> > > > >     ... < Somehow interrupted> ...
> > > > >                                    <swapin src_pte, alloc and use folio A>
> > > > >                                    // folio A is just a new allocated folio
> > > > >                                    // and get installed into src_pte
> > > > >                                    <frees swap entry S1>
> > > > >                                    // src_pte now points to folio A, S1
> > > > >                                    // has swap count == 0, it can be freed
> > > > >                                    // by folio_swap_swap or swap
> > > > >                                    // allocator's reclaim.
> > > > >                                    <try to swap out another folio B>
> > > > >                                    // folio B is a folio in another VMA.
> > > > >                                    <put folio B to swap cache using S1 >
> > > > >                                    // S1 is freed, folio B could use it
> > > > >                                    // for swap out with no problem.
> > > > >                                    ...
> > > > >     folio = filemap_get_folio(S1)
> > > > >     // Got folio B here !!!
> > > > >     ... < Somehow interrupted again> ...
> > > > >                                    <swapin folio B and free S1>
> > > > >                                    // Now S1 is free to be used again.
> > > > >                                    <swapout src_pte & folio A using S1>
> > > > >                                    // Now src_pte is a swap entry pte
> > > > >                                    // holding S1 again.
> > > > >     folio_trylock(folio)
> > > > >     move_swap_pte
> > > > >       double_pt_lock
> > > > >       is_pte_pages_stable
> > > > >       // Check passed because src_pte == S1
> > > > >       folio_move_anon_rmap(...)
> > > > >       // Moved invalid folio B here !!!
> > > > >
> > > > > The race window is very short and requires multiple collisions of
> > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > reproduced [1].
> > > >
> > > > Thanks for catching and fixing this. Just to clarify a few things
> > > > about your reproducer:
> > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > > MAP_PRIVATE will work as well?
> > > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > > indicating that if we are using zram for swap then it doesn't
> > > > reproduce?
> > > >
> > > > >
> > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > again so kernel won't miss a folio move.
> > > > >
> > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > acquiring the src_pte lock.
> > > > >
> > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > we don't need to worry about that since folios only might get exposed to
> > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > checking the swap cache again after acquiring src_pte lock.
> > > > >
> > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > ---
> > > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > >  1 file changed, 26 insertions(+)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -15,6 +15,7 @@
> > > > >  #include <linux/mmu_notifier.h>
> > > > >  #include <linux/hugetlb.h>
> > > > >  #include <linux/shmem_fs.h>
> > > > > +#include <linux/delay.h>
> > > > I guess you mistakenly left it from your reproducer code :)
> > > > >  #include <asm/tlbflush.h>
> > > > >  #include <asm/tlb.h>
> > > > >  #include "internal.h"
> > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > >                          struct folio *src_folio)
> > > > >  {
> > > > > +       swp_entry_t entry;
> > > > > +
> > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > >
> > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > >         if (src_folio) {
> > > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > +       } else {
> > > > > +               /*
> > > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > > +                * miss a new loaded swap cache folio.
> > > > > +                */
> > > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > +                                             swap_cache_index(entry));
> > > >
> > > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > > Please correct me if my assumption about the overhead is wrong.
> > >
> > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > > folio lock?
> >
> > I think no, we can't even make sure the folio is still in the swap
> > cache, so it can be a freed folio that does not belong to any VMA
> > while not holding the folio lock.
>
> Right, but will the following be sufficient, given that we don’t really
> care about the folio—only whether there’s new cache?
>
> if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
>              double_pt_unlock(dst_ptl, src_ptl);
>              return -EAGAIN;
> }

The problem is reading swap_map without locking the cluster map seems
unstable, and has strange false positives, a swapin will set this bit
first, while not adding the folio to swap cache or even when skipping
the swap cache, that seems could make it more complex.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  7:00         ` Kairui Song
@ 2025-05-31  7:06           ` Barry Song
  2025-05-31  7:11             ` Barry Song
  2025-05-31  7:11             ` Lokesh Gidra
  0 siblings, 2 replies; 16+ messages in thread
From: Barry Song @ 2025-05-31  7:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 7:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > >
> > > > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > >
> > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > > >
> > > > > > While working and reviewing the swap table series with Barry, following
> > > > > > existing race is observed and reproduced [1]:
> > > > > >
> > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > > >
> > > > > > CPU1                               CPU2
> > > > > > userfaultfd_move
> > > > > >   move_pages_pte()
> > > > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > > > >     // Here it got entry = S1
> > > > > >     ... < Somehow interrupted> ...
> > > > > >                                    <swapin src_pte, alloc and use folio A>
> > > > > >                                    // folio A is just a new allocated folio
> > > > > >                                    // and get installed into src_pte
> > > > > >                                    <frees swap entry S1>
> > > > > >                                    // src_pte now points to folio A, S1
> > > > > >                                    // has swap count == 0, it can be freed
> > > > > >                                    // by folio_swap_swap or swap
> > > > > >                                    // allocator's reclaim.
> > > > > >                                    <try to swap out another folio B>
> > > > > >                                    // folio B is a folio in another VMA.
> > > > > >                                    <put folio B to swap cache using S1 >
> > > > > >                                    // S1 is freed, folio B could use it
> > > > > >                                    // for swap out with no problem.
> > > > > >                                    ...
> > > > > >     folio = filemap_get_folio(S1)
> > > > > >     // Got folio B here !!!
> > > > > >     ... < Somehow interrupted again> ...
> > > > > >                                    <swapin folio B and free S1>
> > > > > >                                    // Now S1 is free to be used again.
> > > > > >                                    <swapout src_pte & folio A using S1>
> > > > > >                                    // Now src_pte is a swap entry pte
> > > > > >                                    // holding S1 again.
> > > > > >     folio_trylock(folio)
> > > > > >     move_swap_pte
> > > > > >       double_pt_lock
> > > > > >       is_pte_pages_stable
> > > > > >       // Check passed because src_pte == S1
> > > > > >       folio_move_anon_rmap(...)
> > > > > >       // Moved invalid folio B here !!!
> > > > > >
> > > > > > The race window is very short and requires multiple collisions of
> > > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > > reproduced [1].
> > > > >
> > > > > Thanks for catching and fixing this. Just to clarify a few things
> > > > > about your reproducer:
> > > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > > > MAP_PRIVATE will work as well?
> > > > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > > > indicating that if we are using zram for swap then it doesn't
> > > > > reproduce?
> > > > >
> > > > > >
> > > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > > again so kernel won't miss a folio move.
> > > > > >
> > > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > > acquiring the src_pte lock.
> > > > > >
> > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > > we don't need to worry about that since folios only might get exposed to
> > > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > > checking the swap cache again after acquiring src_pte lock.
> > > > > >
> > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > ---
> > > > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > > >  1 file changed, 26 insertions(+)
> > > > > >
> > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > > --- a/mm/userfaultfd.c
> > > > > > +++ b/mm/userfaultfd.c
> > > > > > @@ -15,6 +15,7 @@
> > > > > >  #include <linux/mmu_notifier.h>
> > > > > >  #include <linux/hugetlb.h>
> > > > > >  #include <linux/shmem_fs.h>
> > > > > > +#include <linux/delay.h>
> > > > > I guess you mistakenly left it from your reproducer code :)
> > > > > >  #include <asm/tlbflush.h>
> > > > > >  #include <asm/tlb.h>
> > > > > >  #include "internal.h"
> > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > > >                          struct folio *src_folio)
> > > > > >  {
> > > > > > +       swp_entry_t entry;
> > > > > > +
> > > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > > >
> > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > >         if (src_folio) {
> > > > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > +       } else {
> > > > > > +               /*
> > > > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > > > +                * miss a new loaded swap cache folio.
> > > > > > +                */
> > > > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > > +                                             swap_cache_index(entry));
> > > > >
> > > > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > > > Please correct me if my assumption about the overhead is wrong.
> > > >
> > > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > > > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > > > folio lock?
> > >
> > > I think no, we can't even make sure the folio is still in the swap
> > > cache, so it can be a freed folio that does not belong to any VMA
> > > while not holding the folio lock.
> >
> > Right, but will the following be sufficient, given that we don’t really
> > care about the folio—only whether there’s new cache?
> >
> > if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
> >              double_pt_unlock(dst_ptl, src_ptl);
> >              return -EAGAIN;
> > }
>
> The problem is reading swap_map without locking the cluster map seems
> unstable, and has strange false positives, a swapin will set this bit
> first, while not adding the folio to swap cache or even when skipping
> the swap cache, that seems could make it more complex.

As long as it's a false positive and not a false negative, I think it's
acceptable—especially if we're concerned about the overhead of
filemap_get_folio. The probability is extremely low (practically close
to 0%), but we still need to call filemap_get_folio for every swap PTE.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  7:06           ` Barry Song
@ 2025-05-31  7:11             ` Barry Song
  2025-05-31  7:11             ` Lokesh Gidra
  1 sibling, 0 replies; 16+ messages in thread
From: Barry Song @ 2025-05-31  7:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lokesh Gidra, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 7:06 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 7:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > >
> > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > > > >
> > > > > > > While working and reviewing the swap table series with Barry, following
> > > > > > > existing race is observed and reproduced [1]:
> > > > > > >
> > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > > > >
> > > > > > > CPU1                               CPU2
> > > > > > > userfaultfd_move
> > > > > > >   move_pages_pte()
> > > > > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > > > > >     // Here it got entry = S1
> > > > > > >     ... < Somehow interrupted> ...
> > > > > > >                                    <swapin src_pte, alloc and use folio A>
> > > > > > >                                    // folio A is just a new allocated folio
> > > > > > >                                    // and get installed into src_pte
> > > > > > >                                    <frees swap entry S1>
> > > > > > >                                    // src_pte now points to folio A, S1
> > > > > > >                                    // has swap count == 0, it can be freed
> > > > > > >                                    // by folio_swap_swap or swap
> > > > > > >                                    // allocator's reclaim.
> > > > > > >                                    <try to swap out another folio B>
> > > > > > >                                    // folio B is a folio in another VMA.
> > > > > > >                                    <put folio B to swap cache using S1 >
> > > > > > >                                    // S1 is freed, folio B could use it
> > > > > > >                                    // for swap out with no problem.
> > > > > > >                                    ...
> > > > > > >     folio = filemap_get_folio(S1)
> > > > > > >     // Got folio B here !!!
> > > > > > >     ... < Somehow interrupted again> ...
> > > > > > >                                    <swapin folio B and free S1>
> > > > > > >                                    // Now S1 is free to be used again.
> > > > > > >                                    <swapout src_pte & folio A using S1>
> > > > > > >                                    // Now src_pte is a swap entry pte
> > > > > > >                                    // holding S1 again.
> > > > > > >     folio_trylock(folio)
> > > > > > >     move_swap_pte
> > > > > > >       double_pt_lock
> > > > > > >       is_pte_pages_stable
> > > > > > >       // Check passed because src_pte == S1
> > > > > > >       folio_move_anon_rmap(...)
> > > > > > >       // Moved invalid folio B here !!!
> > > > > > >
> > > > > > > The race window is very short and requires multiple collisions of
> > > > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > > > reproduced [1].
> > > > > >
> > > > > > Thanks for catching and fixing this. Just to clarify a few things
> > > > > > about your reproducer:
> > > > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > > > > MAP_PRIVATE will work as well?
> > > > > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > > > > indicating that if we are using zram for swap then it doesn't
> > > > > > reproduce?
> > > > > >
> > > > > > >
> > > > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > > > again so kernel won't miss a folio move.
> > > > > > >
> > > > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > > > acquiring the src_pte lock.
> > > > > > >
> > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > > > we don't need to worry about that since folios only might get exposed to
> > > > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > > > checking the swap cache again after acquiring src_pte lock.
> > > > > > >
> > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > > ---
> > > > > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > > > >  1 file changed, 26 insertions(+)
> > > > > > >
> > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > > > --- a/mm/userfaultfd.c
> > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > @@ -15,6 +15,7 @@
> > > > > > >  #include <linux/mmu_notifier.h>
> > > > > > >  #include <linux/hugetlb.h>
> > > > > > >  #include <linux/shmem_fs.h>
> > > > > > > +#include <linux/delay.h>
> > > > > > I guess you mistakenly left it from your reproducer code :)
> > > > > > >  #include <asm/tlbflush.h>
> > > > > > >  #include <asm/tlb.h>
> > > > > > >  #include "internal.h"
> > > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > > > >                          struct folio *src_folio)
> > > > > > >  {
> > > > > > > +       swp_entry_t entry;
> > > > > > > +
> > > > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > > > >
> > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > > >         if (src_folio) {
> > > > > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > > > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > +       } else {
> > > > > > > +               /*
> > > > > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > > > > +                * miss a new loaded swap cache folio.
> > > > > > > +                */
> > > > > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > > > +                                             swap_cache_index(entry));
> > > > > >
> > > > > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > > > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > > > > Please correct me if my assumption about the overhead is wrong.
> > > > >
> > > > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > > > > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > > > > folio lock?
> > > >
> > > > I think no, we can't even make sure the folio is still in the swap
> > > > cache, so it can be a freed folio that does not belong to any VMA
> > > > while not holding the folio lock.
> > >
> > > Right, but will the following be sufficient, given that we don’t really
> > > care about the folio—only whether there’s new cache?
> > >
> > > if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
> > >              double_pt_unlock(dst_ptl, src_ptl);
> > >              return -EAGAIN;
> > > }
> >
> > The problem is reading swap_map without locking the cluster map seems
> > unstable, and has strange false positives, a swapin will set this bit
> > first, while not adding the folio to swap cache or even when skipping
> > the swap cache, that seems could make it more complex.
>
> As long as it's a false positive and not a false negative, I think it's
> acceptable—especially if we're concerned about the overhead of
> filemap_get_folio. The probability is extremely low (practically close
> to 0%), but we still need to call filemap_get_folio for every swap PTE.

By the way, if we do want to eliminate the false positives, we could do
the following:

if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
              folio = filemap_get_folio()
              if (folio....) {
                   double_pt_unlock(dst_ptl, src_ptl);
                   return -EAGAIN;
              }
}

This might eliminate 99.999999...% of filemap_get_folio calls, though my
gut feeling is that it might not be necessary :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache
  2025-05-31  7:06           ` Barry Song
  2025-05-31  7:11             ` Barry Song
@ 2025-05-31  7:11             ` Lokesh Gidra
  1 sibling, 0 replies; 16+ messages in thread
From: Lokesh Gidra @ 2025-05-31  7:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, linux-mm, Andrew Morton, Peter Xu,
	Suren Baghdasaryan, Andrea Arcangeli, David Hildenbrand, stable,
	linux-kernel

On Sat, May 31, 2025 at 12:06 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 31, 2025 at 7:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, May 31, 2025 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 31, 2025 at 6:25 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, May 31, 2025 at 11:39 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sat, May 31, 2025 at 11:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 1:17 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > >
> > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> > > > > > > lookup, and try to move the found folio to the faulting vma when.
> > > > > > > Currently, it relies on the PTE value check to ensure the moved folio
> > > > > > > still belongs to the src swap entry, which turns out is not reliable.
> > > > > > >
> > > > > > > While working and reviewing the swap table series with Barry, following
> > > > > > > existing race is observed and reproduced [1]:
> > > > > > >
> > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
> > > > > > >  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> > > > > > >
> > > > > > > CPU1                               CPU2
> > > > > > > userfaultfd_move
> > > > > > >   move_pages_pte()
> > > > > > >     entry = pte_to_swp_entry(orig_src_pte);
> > > > > > >     // Here it got entry = S1
> > > > > > >     ... < Somehow interrupted> ...
> > > > > > >                                    <swapin src_pte, alloc and use folio A>
> > > > > > >                                    // folio A is just a new allocated folio
> > > > > > >                                    // and get installed into src_pte
> > > > > > >                                    <frees swap entry S1>
> > > > > > >                                    // src_pte now points to folio A, S1
> > > > > > >                                    // has swap count == 0, it can be freed
> > > > > > >                                    // by folio_swap_swap or swap
> > > > > > >                                    // allocator's reclaim.
> > > > > > >                                    <try to swap out another folio B>
> > > > > > >                                    // folio B is a folio in another VMA.
> > > > > > >                                    <put folio B to swap cache using S1 >
> > > > > > >                                    // S1 is freed, folio B could use it
> > > > > > >                                    // for swap out with no problem.
> > > > > > >                                    ...
> > > > > > >     folio = filemap_get_folio(S1)
> > > > > > >     // Got folio B here !!!
> > > > > > >     ... < Somehow interrupted again> ...
> > > > > > >                                    <swapin folio B and free S1>
> > > > > > >                                    // Now S1 is free to be used again.
> > > > > > >                                    <swapout src_pte & folio A using S1>
> > > > > > >                                    // Now src_pte is a swap entry pte
> > > > > > >                                    // holding S1 again.
> > > > > > >     folio_trylock(folio)
> > > > > > >     move_swap_pte
> > > > > > >       double_pt_lock
> > > > > > >       is_pte_pages_stable
> > > > > > >       // Check passed because src_pte == S1
> > > > > > >       folio_move_anon_rmap(...)
> > > > > > >       // Moved invalid folio B here !!!
> > > > > > >
> > > > > > > The race window is very short and requires multiple collisions of
> > > > > > > multiple rare events, so it's very unlikely to happen, but with a
> > > > > > > deliberately constructed reproducer and increased time window, it can be
> > > > > > > reproduced [1].
> > > > > >
> > > > > > Thanks for catching and fixing this. Just to clarify a few things
> > > > > > about your reproducer:
> > > > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or
> > > > > > MAP_PRIVATE will work as well?
> > > > > > 2. You mentioned that the 'current dir is on a block device'. Are you
> > > > > > indicating that if we are using zram for swap then it doesn't
> > > > > > reproduce?
> > > > > >
> > > > > > >
> > > > > > > It's also possible that folio (A) is swapped in, and swapped out again
> > > > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in
> > > > > > > swap cache so it needs to be moved too. In this case we should also try
> > > > > > > again so kernel won't miss a folio move.
> > > > > > >
> > > > > > > Fix this by checking if the folio is the valid swap cache folio after
> > > > > > > acquiring the folio lock, and checking the swap cache again after
> > > > > > > acquiring the src_pte lock.
> > > > > > >
> > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> > > > > > > we don't need to worry about that since folios only might get exposed to
> > > > > > > swap cache in the swap out path, and it's covered in this patch too by
> > > > > > > checking the swap cache again after acquiring src_pte lock.
> > > > > > >
> > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > > ---
> > > > > > >  mm/userfaultfd.c | 26 ++++++++++++++++++++++++++
> > > > > > >  1 file changed, 26 insertions(+)
> > > > > > >
> > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > index bc473ad21202..a1564d205dfb 100644
> > > > > > > --- a/mm/userfaultfd.c
> > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > @@ -15,6 +15,7 @@
> > > > > > >  #include <linux/mmu_notifier.h>
> > > > > > >  #include <linux/hugetlb.h>
> > > > > > >  #include <linux/shmem_fs.h>
> > > > > > > +#include <linux/delay.h>
> > > > > > I guess you mistakenly left it from your reproducer code :)
> > > > > > >  #include <asm/tlbflush.h>
> > > > > > >  #include <asm/tlb.h>
> > > > > > >  #include "internal.h"
> > > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > > > >                          struct folio *src_folio)
> > > > > > >  {
> > > > > > > +       swp_entry_t entry;
> > > > > > > +
> > > > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > > > >
> > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> > > > > > >         if (src_folio) {
> > > > > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > > > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > +       } else {
> > > > > > > +               /*
> > > > > > > +                * Check again after acquiring the src_pte lock. Or we might
> > > > > > > +                * miss a new loaded swap cache folio.
> > > > > > > +                */
> > > > > > > +               entry = pte_to_swp_entry(orig_src_pte);
> > > > > > > +               src_folio = filemap_get_folio(swap_address_space(entry),
> > > > > > > +                                             swap_cache_index(entry));
> > > > > >
> > > > > > Given the non-trivial overhead of filemap_get_folio(), do you think it
> > > > > > will work if filemap_get_filio() was only once after locking src_ptl?
> > > > > > Please correct me if my assumption about the overhead is wrong.
> > > > >
> > > > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte().
> > > > > can we safely folio_move_anon_rmap + src_folio->index while not holding
> > > > > folio lock?
> > > >
> > > > I think no, we can't even make sure the folio is still in the swap
> > > > cache, so it can be a freed folio that does not belong to any VMA
> > > > while not holding the folio lock.
> > >
> > > Right, but will the following be sufficient, given that we don’t really
> > > care about the folio—only whether there’s new cache?
> > >
> > > if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) {
> > >              double_pt_unlock(dst_ptl, src_ptl);
> > >              return -EAGAIN;
> > > }
> >
> > The problem is reading swap_map without locking the cluster map seems
> > unstable, and has strange false positives, a swapin will set this bit
> > first, while not adding the folio to swap cache or even when skipping
> > the swap cache, that seems could make it more complex.
>
> As long as it's a false positive and not a false negative, I think it's
> acceptable—especially if we're concerned about the overhead of
> filemap_get_folio. The probability is extremely low (practically close
> to 0%), but we still need to call filemap_get_folio for every swap PTE.
>
That's exactly my concern too. A retry or EAGAIN on rare false
positives are acceptable. But adding an additional call to
filemap_get_folio, that too with PTL for both src and dst locked is
not cheap. Consider that on a multi-threaded application, there could
be many threads blocked on the same PTL. So keeping that critical
section as short as possible is desirable.

> Thanks
> Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-31  7:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 20:17 [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
2025-05-30 23:40 ` Lokesh Gidra
2025-05-31  3:37   ` Barry Song
2025-05-31  6:25     ` Kairui Song
2025-05-31  6:35       ` Barry Song
2025-05-31  7:00         ` Kairui Song
2025-05-31  7:06           ` Barry Song
2025-05-31  7:11             ` Barry Song
2025-05-31  7:11             ` Lokesh Gidra
2025-05-31  6:22   ` Kairui Song
2025-05-31  4:04 ` Barry Song
2025-05-31  4:41   ` Barry Song
2025-05-31  6:10     ` Lokesh Gidra
2025-05-31  6:36       ` Kairui Song
2025-05-31  6:54         ` Barry Song
2025-05-31  6:54     ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).