Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Barry Song <21cnbao@gmail.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache
Date: Mon, 2 Jun 2025 16:34:16 -0400	[thread overview]
Message-ID: <aD4KyHz_H5WPLLf4@x1.local> (raw)
In-Reply-To: <20250602181419.20478-1-ryncsn@gmail.com>

On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
> lookup, and try to move the found folio to the faulting vma when.
> Currently, it relies on the PTE value check to ensure the moved folio
> still belongs to the src swap entry, which turns out is not reliable.
> 
> While working and reviewing the swap table series with Barry, following
> existing race is observed and reproduced [1]:
> 
> ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
>  swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
> 
> CPU1                               CPU2
> userfaultfd_move
>   move_pages_pte()
>     entry = pte_to_swp_entry(orig_src_pte);
>     // Here it got entry = S1
>     ... < Somehow interrupted> ...
>                                    <swapin src_pte, alloc and use folio A>
>                                    // folio A is just a new allocated folio
>                                    // and get installed into src_pte
>                                    <frees swap entry S1>
>                                    // src_pte now points to folio A, S1
>                                    // has swap count == 0, it can be freed
>                                    // by folio_swap_swap or swap
>                                    // allocator's reclaim.
>                                    <try to swap out another folio B>
>                                    // folio B is a folio in another VMA.
>                                    <put folio B to swap cache using S1 >
>                                    // S1 is freed, folio B could use it
>                                    // for swap out with no problem.
>                                    ...
>     folio = filemap_get_folio(S1)
>     // Got folio B here !!!
>     ... < Somehow interrupted again> ...
>                                    <swapin folio B and free S1>
>                                    // Now S1 is free to be used again.
>                                    <swapout src_pte & folio A using S1>
>                                    // Now src_pte is a swap entry pte
>                                    // holding S1 again.
>     folio_trylock(folio)
>     move_swap_pte
>       double_pt_lock
>       is_pte_pages_stable
>       // Check passed because src_pte == S1
>       folio_move_anon_rmap(...)
>       // Moved invalid folio B here !!!
> 
> The race window is very short and requires multiple collisions of
> multiple rare events, so it's very unlikely to happen, but with a
> deliberately constructed reproducer and increased time window, it can be
> reproduced [1].
> 
> It's also possible that folio (A) is swapped in, and swapped out again
> after the filemap_get_folio lookup, in such case folio (A) may stay in
> swap cache so it needs to be moved too. In this case we should also try
> again so kernel won't miss a folio move.
> 
> Fix this by checking if the folio is the valid swap cache folio after
> acquiring the folio lock, and checking the swap cache again after
> acquiring the src_pte lock.
> 
> SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
> we don't need to worry about that since folios only might get exposed to
> swap cache in the swap out path, and it's covered in this patch too by
> checking the swap cache again after acquiring src_pte lock.

[1]

> 
> Testing with a simple C program to allocate and move several GB of memory
> did not show any observable performance change.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> ---
> 
> V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
> Changes:
> - Check swap_map instead of doing a filemap lookup after acquiring the
>   PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
> 
> V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
> Changes:
> - Move the folio and swap check inside move_swap_pte to avoid skipping
>   the check and potential overhead [ Lokesh Gidra ]
> - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
>   value.
> 
>  mm/userfaultfd.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..5dc05346e360 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  			 pte_t orig_dst_pte, pte_t orig_src_pte,
>  			 pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			 struct folio *src_folio)
> +			 struct folio *src_folio,
> +			 struct swap_info_struct *si, swp_entry_t entry)
>  {
> +	/*
> +	 * Check if the folio still belongs to the target swap entry after
> +	 * acquiring the lock. Folio can be freed in the swap cache while
> +	 * not locked.
> +	 */
> +	if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
> +				  entry.val != src_folio->swap.val))
> +		return -EAGAIN;
> +
>  	double_pt_lock(dst_ptl, src_ptl);
>  
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  	if (src_folio) {
>  		folio_move_anon_rmap(src_folio, dst_vma);
>  		src_folio->index = linear_page_index(dst_vma, dst_addr);
> +	} else {
> +		/*
> +		 * Check if the swap entry is cached after acquiring the src_pte
> +		 * lock. Or we might miss a new loaded swap cache folio.
> +		 */
> +		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {

Do we need data_race() for this, if this is an intentionally lockless read?

Another pure swap question: the comment seems to imply this whole thing is
protected by src_pte lock, but is it?

I'm not familiar enough with swap code, but it looks to me the folio can be
added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the
folio is locked.  It doesn't seem to be directly protected by pgtable lock.

Perhaps you meant this: since src_pte lock is held, then it'll serialize
with another thread B concurrently swap-in the swap entry, but only _later_
when thread B's do_swap_page() will check again on pte_same(), then it'll
see the src pte gone (after thread A uffdio_move happened releasing src_pte
lock), hence thread B will release the newly allocated swap cache folio?

There's another trivial detail that IIUC pte_same() must fail because
before/after the uffdio_move the swap entry will be occupied so no way to
have it reused, hence src_pte, even if re-populated again after uffdio_move
succeeded, cannot become the orig_pte (points to the swap entry in
question) that thread B read, hence pte_same() must check fail.

I'm not sure my understanding is correct, though.  Maybe some richer
comment would always help.

Thanks,

> +			double_pt_unlock(dst_ptl, src_ptl);
> +			return -EAGAIN;
> +		}
>  	}
>  
>  	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		}
>  		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>  				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> -				dst_ptl, src_ptl, src_folio);
> +				dst_ptl, src_ptl, src_folio, si, entry);
>  	}
>  
>  out:
> -- 
> 2.49.0
> 
> 

-- 
Peter Xu

next prev parent reply	other threads:[~2025-06-02 20:34 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-02 18:14 [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Kairui Song
2025-06-02 20:17 ` Lokesh Gidra
2025-06-02 20:34 ` Peter Xu [this message]
2025-06-02 22:08   ` Barry Song
2025-06-03 11:48     ` Kairui Song
2025-06-03 16:42       ` Peter Xu
2025-06-03  7:03   ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aD4KyHz_H5WPLLf4@x1.local \
    --to=peterx@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=stable@vger.kernel.org \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.