Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Rapoport <rppt@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Carlier <devnexen@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Heechan Kang <gganji11@naver.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	Michael Bommarito <michael.bommarito@gmail.com>,
	Peter Xu <peterx@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
Date: Tue, 26 May 2026 20:30:32 +0300	[thread overview]
Message-ID: <ahXYuNSxe7FaDHE-@kernel.org> (raw)
In-Reply-To: <ahWh7jsmLfY5awPR@lucifer>

On Tue, May 26, 2026 at 04:12:56PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 19, 2026 at 08:25:16AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> > reacquires it afterwards. The destination VMA can be replaced during that
> > window.
> >
> > The existing check compares vma_uffd_ops() before and after the retry, but
> > if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> > MAP_PRIVATE (or vice versa) the replacement goes undetected.
> >
> > The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> > with shmem_alloc_folio() as anonymous and this will cause BUG() when
> > mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> >
> > The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> > the page cache of the original VMA.
> 
> OK this seems like a sensible subset of things to check - we are looking to
> check what might materially impact _this particular operation_, and working
> to the absolute minimum we need to check.
> 
> >
> > Introduce helpers for more comprehensive comparison of VMA state:
> > - vma_snapshot_get() to save the relevant VMA state into a struct
> >   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
> >   vm_file and pgoff) before dropping the lock
> > - vma_snapshot_changed() to compare the saved state with the state of the
> >   VMA acquired after retaking the locks
> > - vma_snapshot_put() to release vm_file pinning.
> 
> I feel like this is possibly a little misleading, it sort of implies that
> this is a broader kind of snapshot rather than some uffd-specific thing.
> 
> This is more of a naming thing, yes things being defined in
> mm/userfaultfd.c tells you it's uffd-specific, but people are easily
> confused by stuff like this.
> 
> Will comment inline.

tl;dr:
* will update comments
* will rename vma_uffd_copy_ops() and replace NULL checks with VM_WARN
* will change some of single letter variable names.
 
> >
> > Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> > error handling paths in mfill_copy_folio_retry().
> 
> That's nice!
> 
> >
> > Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> > shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> 
> I think this is a bit overly brief. And is it only shmem where this might
> happen? Is MAP_PRIVATE hugetlbfs not also possible?

No, hugetlb is not applicabe here, it was not changed to use
mfill_copy_folio_retry().
 
> >
> > Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> > Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> 
> Hmm this is the second multiple-fixes patch I've seen, maybe that is OK to
> do :) I guess if this one, indivisble change fixes both then that's valid.
> 
> > Tested-by: Heechan Kang <gganji11@naver.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Co-developed-by: David Carlier <devnexen@gmail.com>
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 79 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 180bad42fc79..b70b84776a79 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -14,6 +14,8 @@
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/file.h>
> > +#include <linux/cleanup.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> 
> I find this a bit confusing, these are the uffd ops for... copy? Or you're
> copying uffd ops?
> 
> I'd maybe rename it to vma_uffd_ops_on_copy() to make it clear which it is,
> and add a comment like:

vma_uffd_ops_for_copy()?
 
> /*
>  * Determine the ops to use when performing a UFFDIO_COPY operation - this
>  * specifically handles the case of a MAP_PRIVATE file-backed mapping,
>  * which, upon copy, is CoW'd into anonymous memory.
>  */

I'll pick the large comment that got dropped and stick it here.
 
> From the snapshot point of view, I have been pondering if this was the
> clearest way of checking things because you are, in effect checking a
> specific case
> 
> it, but after much back and forth, perhaps it's not too crazy... you are in
> most cases doing a duplicate check to the ops check
> 
> 
> 
> > +{
> > +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +	if (!ops)
> > +		return NULL;
> 
> Is this correct? The code you removed from mfill_atomic_pte_copy()
> unconditionally replaces the ops with &anon_uffd_ops, regardless of whether
> any operations were specified, but now not providing ops avoids that?
> 
> I mean it's correct if we can never have !ops, and presumably we can't
> because __mfill_atomic_pte() unconditionally calls ops->alloc_folio() right
> at the top of the logic?
> 
> Doesn't uffd only support shmem, anon + hugetlbfs? So why are we adding
> !ops checks at all?
> 
> Actually claude (:>) tells me that async-WP memory mode (ugh god) can be
> used for _anything_:
> 
> 	/*
> 	 * If WP is the only mode enabled and context is wp async, allow any
> 	 * memory type.
> 	 */
> 	if (wp_async && (vm_flags == VM_UFFD_WP))
> 		return true;
> 
> But. This function is never called in anything but a copy context?

wp_async has nothing to do with copy (or zeropage).
 
> I'm not sure if you were protecting against a possible future usage? But in
> that case surely you'd just want to VM_WARN_ON_ONCE(!ops)?
> 
> We should also have a comment describing the situation like:
> 
> 	/*
> 	 * Only async WP allows arbitrary file-backed mappings for UFFD,
> 	 * which is the only case in which ops can be NULL. However,
> 	 * we are copying here which is not permitted for these
>          * mappings, so this will never be the case here.
> 	 */

I can live with one sentence tops here ;-)

> 	 VM_WARN_ON_ONCE(!ops);
> 
> Also in:
> 
> static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> {
> 	if (vma_is_anonymous(vma))
> 		return &anon_uffd_ops;
> 	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> }
> 
> This is doing a redundant check _and_ making life confusing, as if
> !vma->vm_ops is a condition that can be reached there, it can't, as
> vma_is_anonymous() is literally a !vma->vm_ops check :)
> 
> So surely that should be:
> 
> static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> {
> 	if (vma_is_anonymous(vma))
> 		return &anon_uffd_ops;
> 
> 	return vma->vm_ops->uffd_ops;
> }
> 
> I mean obviously that can be a separate patch, but that's not helped
> make things clear here! :)
 
You are welcome to send it ;-)

> > +
> > +	/*
> > +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +	 * memory. This is an effective ops override, so retry validation must
> > +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> > +	 */
> 
> The wording here is super confusing. 'This is an effective ops override'
> and 'retry validation' and also the 'not just'.
> 
> Also it makes zero reference to the invocation of this function from
> mfill_atomic_pte_copy() - this function is pulling double-duty so this
> comment isn't really quite correct.
> 
> I'd maybe pull up the comment from mfill_atomic_pte_copy() (and adapt it to
> suit the changes, also mentioning the snapshot use case).
> 
> > +	if (!(vma->vm_flags & VM_SHARED))
> > +		return &anon_uffd_ops;
> 
> This is inconsistent with the rest of the VMA flags API usage, you want:

This is consistent with the rest of VMA usage in uffd, we can convert it
all in one go if you'd like.
 
> 	if (!vma_test(vma, VMA_SHARED_BIT))
> 		return &anon_uffd_ops;
> 
> > +
> > +	return ops;
> > +}
> > +
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> >  {
> > @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> >  	return ret;
> >  }
> >
> > +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> 
> Hm it's a bit weird to have a VMA flags define outside of mm.h.
> 
> But I definitely appreciate you using the new VMA flags stuff :)
> 
> Probably just renaming to VMA_UFFD_COPY_STATE_FLAGS or something?
>
> > +
> 
> I'm a bit of a broken record I know :) but a comment here would be helpful
> describing what this is for.
> 
> > +struct vma_snapshot {
> > +	const struct vm_uffd_ops *copy_ops;
> > +	const struct vm_uffd_ops *ops;
> > +	struct file *file;
> > +	vma_flags_t flags;
> > +	pgoff_t pgoff;
> > +};
> 
> Yeah I don't love this naming, you're not snapshotting the VMA at all in
> any real sense, you're comparing a subset of things specifically for the
> UFFDIO_COPY case.
> 
> Which is fine - but I think we should just make that clear here, because
> perhaps in the future we'll want a generic version of something like this
> and this is going to be super confusing if we do...

Let's deal with it in the future then. This is a local structure, used by a
few local functions.
 
> vma_uffdio_copy_snapshot maybe ?
> 
> Maybe ok to leave the functions the same name to avoid a mouthful, the type
> name should make it clear what's going on.
> 
> > +
> > +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
> 
> No single letter var please! This isn't plan9 :)

"C is a Spartan language, and your naming conventions should follow suit"
https://docs.kernel.org/process/coding-style.html#naming

Here it's perfectly understandable what 's' means.
 
> > +{
> > +	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > +	s->copy_ops = vma_uffd_copy_ops(vma);
> > +	s->ops = vma_uffd_ops(vma);
> > +	s->pgoff = vma->vm_pgoff;
> > +
> > +	if (vma->vm_file)
> > +		s->file = get_file(vma->vm_file);
> > +}
> > +
> > +static bool vma_snapshot_changed(struct vma_snapshot *s,
> 
> > +				 struct vm_area_struct *vma)
> 
> Can the VMA potentially be any arbitrary VMA that might now be remapped in
> place? Like anon vs. file and same file and offset?
> 
> 
> > +{
> > +	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > +
> 
> You have a comment for each of the other cases, so one here is useful I
> think, e.g.:
> 
> 	/* Have any UFFD flags (missing, WP, minor) changed? */
> 
> > +	if (!vma_flags_same_pair(&s->flags, &flags))
> > +		return true;
> 
> > +
> > +	/* VMA type or effective uffd_ops changed while the lock was dropped */
> 
> Can the 'effective' (probably better to say copy or use effective
> throughout) uffd_ops change though? You already check that VMA_SHARED_BIT
> is either set or unset across the lock drop above so is the copy_ops check
> even meaningful here?

Yes, it can change if a shmem VMA changes from SHARED to PRIVATE or vice
versa.
 
> > +	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> > +		return true;
> Surely we should also check that the UFFD context matches (accounting for
> NULL cases)?

It's checked in mfill_get_vma().
 
> Could we not otherwise have weird situations across UFFD contexts if both
> VMAs happen to contain identitical properties otherwise?
> 
> This seems to be a canonical part of identifying a UFFD VMA so strange to
> not see it here?
> 
> > +
> > +	/* VMA was anonymous before; changed only if it no longer is */
> > +	if (!s->file)
> > +		return !vma_is_anonymous(vma);
> > +
> > +	/* VMA was file backed, but inode or offset has changed */
> > +	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> > +	    vma->vm_pgoff != s->pgoff)
> > +		return true;
> 
> Hm so are we OK with a VMA where the file has been closed, then the same
> inode re-opened at the exact same offset? (e.g. vma->vm_file differs)

The copy will go to exactly the same place in that case.
 
> Would this be problematic with hugetlbfs or shmem?
> 
> I guess we pin the inode so the f_inode check _should_ be valid, from that
> side.
> 
> > +
> > +	return false;
> > +}
> > +
> > +static void vma_snapshot_put(struct vma_snapshot *s)
> > +{
> > +	if (s->file)
> > +		fput(s->file);
> > +}
> > +
> > +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> > +
> >  static int mfill_copy_folio_retry(struct mfill_state *state,
> >  				  struct folio *folio)
> >  {
> > -	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> > +	struct vma_snapshot s = { 0 };
> > +	struct vma_snapshot *p __free(snapshot_put) = &s;
> 
> Could we avoid these single letter var names?

You suggest to call it a_dummy_pointer_to_allow_use_of_free_cleanup? ;-P
> 
> >  	unsigned long src_addr = state->src_addr;
> >  	void *kaddr;
> >  	int err;
> >
> > +	vma_snapshot_get(&s, state->vma);
> > +
> >  	/* retry copying with mm_lock dropped */
> >  	mfill_put_vma(state);
> >
> > @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
> >  	if (err)
> >  		return err;
> >
> > -	/*
> > -	 * The VMA type may have changed while the lock was dropped
> > -	 * (e.g. replaced with a hugetlb mapping), making the caller's
> > -	 * ops pointer stale.
> > -	 */
> 
> This comment we can lose given we are explicitly comparing snapshots, but I
> think it's not great to explicitly point out that we're doing this because
> of the lock being dropped, we should say that somewhere.
> 
> > -	if (vma_uffd_ops(state->vma) != orig_ops)
> > +	if (vma_snapshot_changed(&s, state->vma))
> >  		return -EAGAIN;
> >
> >  	err = mfill_establish_pmd(state);
> > @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
> >
> >  static int mfill_atomic_pte_copy(struct mfill_state *state)
> >  {
> > -	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> > -
> > -	/*
> > -	 * The normal page fault path for a MAP_PRIVATE mapping in a
> > -	 * file-backed VMA will invoke the fault, fill the hole in the file and
> > -	 * COW it right away. The result generates plain anonymous memory.
> > -	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> > -	 * generate anonymous memory directly without actually filling the
> > -	 * hole. For the MAP_PRIVATE case the robustness check only happens in
> > -	 * the pagetable (to verify it's still none) and not in the page cache.
> > -	 */
> 
> (As mentioned above) I hate that we lose this comment (or the gist of it at
> least), can we please move it up to the copy ops function?

Will move it before the copy_ops()
 
> > -	if (!(state->vma->vm_flags & VM_SHARED))
> > -		ops = &anon_uffd_ops;
> 
> A comment here would maybe be useful. But with a function rename and a
> comment at the function itself maybe OK without.
> 
> > +	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
> 
> Can we please have a:
> 
> 	/* Only async W/P can have missing ops, and that can't copy. */
> 	VM_WARN_ON_ONCE(!ops);

It would be the third time :-D
 
> Check here to make it absolutely clear you're not running straight into a
> NULL pointer deref in __mfill_atomic_pte()? Maybe worth putting a similar
> assert there too.
> 
> >
> >  	return __mfill_atomic_pte(state, ops);
> >  }
> >
> > base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> > --
> > 2.53.0
> >
> 
> Thanks, Lorenzo

-- 
Sincerely yours,
Mike.

next prev parent reply	other threads:[~2026-05-26 17:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
2026-05-19  5:36 ` David CARLIER
2026-05-20 12:40   ` Mike Rapoport
2026-05-20 11:09 ` David Hildenbrand (Arm)
2026-05-20 12:53   ` Mike Rapoport
2026-05-20 13:48     ` David Hildenbrand (Arm)
2026-05-20 14:03       ` David Hildenbrand (Arm)
2026-05-26 15:23         ` Lorenzo Stoakes
2026-05-20 14:12       ` Mike Rapoport
2026-05-20 14:38         ` David Hildenbrand (Arm)
2026-05-25 17:12           ` Liam R. Howlett
2026-05-26 12:47             ` David Hildenbrand (Arm)
2026-05-26 15:25               ` Lorenzo Stoakes
2026-05-26 19:06                 ` Liam R. Howlett
2026-05-26 15:12 ` Lorenzo Stoakes
2026-05-26 17:30   ` Mike Rapoport [this message]
2026-05-27 16:08     ` Lorenzo Stoakes
2026-05-27 19:01       ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahXYuNSxe7FaDHE-@kernel.org \
    --to=rppt@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=devnexen@gmail.com \
    --cc=gganji11@naver.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=michael.bommarito@gmail.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox