Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <ljs@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 David Carlier <devnexen@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	 Heechan Kang <gganji11@naver.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	 Michael Bommarito <michael.bommarito@gmail.com>,
	Peter Xu <peterx@redhat.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
Date: Wed, 27 May 2026 17:08:23 +0100	[thread overview]
Message-ID: <ahcUDJ-eZP2GRwOJ@lucifer> (raw)
In-Reply-To: <ahXYuNSxe7FaDHE-@kernel.org>

On Tue, May 26, 2026 at 08:30:32PM +0300, Mike Rapoport wrote:
> On Tue, May 26, 2026 at 04:12:56PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 19, 2026 at 08:25:16AM +0300, Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > >
> > > mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> > > reacquires it afterwards. The destination VMA can be replaced during that
> > > window.
> > >
> > > The existing check compares vma_uffd_ops() before and after the retry, but
> > > if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> > > MAP_PRIVATE (or vice versa) the replacement goes undetected.
> > >
> > > The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> > > with shmem_alloc_folio() as anonymous and this will cause BUG() when
> > > mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> > >
> > > The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> > > the page cache of the original VMA.
> >
> > OK this seems like a sensible subset of things to check - we are looking to
> > check what might materially impact _this particular operation_, and working
> > to the absolute minimum we need to check.
> >
> > >
> > > Introduce helpers for more comprehensive comparison of VMA state:
> > > - vma_snapshot_get() to save the relevant VMA state into a struct
> > >   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
> > >   vm_file and pgoff) before dropping the lock
> > > - vma_snapshot_changed() to compare the saved state with the state of the
> > >   VMA acquired after retaking the locks
> > > - vma_snapshot_put() to release vm_file pinning.
> >
> > I feel like this is possibly a little misleading, it sort of implies that
> > this is a broader kind of snapshot rather than some uffd-specific thing.
> >
> > This is more of a naming thing, yes things being defined in
> > mm/userfaultfd.c tells you it's uffd-specific, but people are easily
> > confused by stuff like this.
> >
> > Will comment inline.
>
> tl;dr:
> * will update comments
> * will rename vma_uffd_copy_ops() and replace NULL checks with VM_WARN
> * will change some of single letter variable names.

Ack, thanks!

>
> > >
> > > Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> > > error handling paths in mfill_copy_folio_retry().
> >
> > That's nice!
> >
> > >
> > > Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> > > shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> >
> > I think this is a bit overly brief. And is it only shmem where this might
> > happen? Is MAP_PRIVATE hugetlbfs not also possible?
>
> No, hugetlb is not applicabe here, it was not changed to use
> mfill_copy_folio_retry().

OK, maybe worth a mention?

>
> > >
> > > Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> > > Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> >
> > Hmm this is the second multiple-fixes patch I've seen, maybe that is OK to
> > do :) I guess if this one, indivisble change fixes both then that's valid.
> >
> > > Tested-by: Heechan Kang <gganji11@naver.com>
> > > Suggested-by: Peter Xu <peterx@redhat.com>
> > > Co-developed-by: David Carlier <devnexen@gmail.com>
> > > Signed-off-by: David Carlier <devnexen@gmail.com>
> > > Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> > > Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > ---
> > >  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 79 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index 180bad42fc79..b70b84776a79 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -14,6 +14,8 @@
> > >  #include <linux/userfaultfd_k.h>
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/hugetlb.h>
> > > +#include <linux/file.h>
> > > +#include <linux/cleanup.h>
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> > >  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> > >  }
> > >
> > > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> >
> > I find this a bit confusing, these are the uffd ops for... copy? Or you're
> > copying uffd ops?
> >
> > I'd maybe rename it to vma_uffd_ops_on_copy() to make it clear which it is,
> > and add a comment like:
>
> vma_uffd_ops_for_copy()?

Works for me!

>
> > /*
> >  * Determine the ops to use when performing a UFFDIO_COPY operation - this
> >  * specifically handles the case of a MAP_PRIVATE file-backed mapping,
> >  * which, upon copy, is CoW'd into anonymous memory.
> >  */
>
> I'll pick the large comment that got dropped and stick it here.

Ack thanks.

>
> > From the snapshot point of view, I have been pondering if this was the
> > clearest way of checking things because you are, in effect checking a
> > specific case
> >
> > it, but after much back and forth, perhaps it's not too crazy... you are in
> > most cases doing a duplicate check to the ops check
> >
> >
> >
> > > +{
> > > +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > > +
> > > +	if (!ops)
> > > +		return NULL;
> >
> > Is this correct? The code you removed from mfill_atomic_pte_copy()
> > unconditionally replaces the ops with &anon_uffd_ops, regardless of whether
> > any operations were specified, but now not providing ops avoids that?
> >
> > I mean it's correct if we can never have !ops, and presumably we can't
> > because __mfill_atomic_pte() unconditionally calls ops->alloc_folio() right
> > at the top of the logic?
> >
> > Doesn't uffd only support shmem, anon + hugetlbfs? So why are we adding
> > !ops checks at all?
> >
> > Actually claude (:>) tells me that async-WP memory mode (ugh god) can be
> > used for _anything_:
> >
> > 	/*
> > 	 * If WP is the only mode enabled and context is wp async, allow any
> > 	 * memory type.
> > 	 */
> > 	if (wp_async && (vm_flags == VM_UFFD_WP))
> > 		return true;
> >
> > But. This function is never called in anything but a copy context?
>
> wp_async has nothing to do with copy (or zeropage).

Yup, so ops can't be NULL, and all the NULL checks are confusing. Covered
elsewhere.

>
> > I'm not sure if you were protecting against a possible future usage? But in
> > that case surely you'd just want to VM_WARN_ON_ONCE(!ops)?
> >
> > We should also have a comment describing the situation like:
> >
> > 	/*
> > 	 * Only async WP allows arbitrary file-backed mappings for UFFD,
> > 	 * which is the only case in which ops can be NULL. However,
> > 	 * we are copying here which is not permitted for these
> >          * mappings, so this will never be the case here.
> > 	 */
>
> I can live with one sentence tops here ;-)

It took me a _long_ time to understand what you were doing here, there's a
balance to be had in commenting, but right now it's tilted towards 'nobody but
Mike understands this'.

I can tell, because nobody else bothered to review this in detail ;)

I guess I'll see what the respin looks like.

>
> > 	 VM_WARN_ON_ONCE(!ops);
> >
> > Also in:
> >
> > static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> > {
> > 	if (vma_is_anonymous(vma))
> > 		return &anon_uffd_ops;
> > 	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> > }
> >
> > This is doing a redundant check _and_ making life confusing, as if
> > !vma->vm_ops is a condition that can be reached there, it can't, as
> > vma_is_anonymous() is literally a !vma->vm_ops check :)
> >
> > So surely that should be:
> >
> > static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> > {
> > 	if (vma_is_anonymous(vma))
> > 		return &anon_uffd_ops;
> >
> > 	return vma->vm_ops->uffd_ops;
> > }
> >
> > I mean obviously that can be a separate patch, but that's not helped
> > make things clear here! :)
>
> You are welcome to send it ;-)

Well you asked for it ;)

>
> > > +
> > > +	/*
> > > +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > > +	 * memory. This is an effective ops override, so retry validation must
> > > +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> > > +	 */
> >
> > The wording here is super confusing. 'This is an effective ops override'
> > and 'retry validation' and also the 'not just'.
> >
> > Also it makes zero reference to the invocation of this function from
> > mfill_atomic_pte_copy() - this function is pulling double-duty so this
> > comment isn't really quite correct.
> >
> > I'd maybe pull up the comment from mfill_atomic_pte_copy() (and adapt it to
> > suit the changes, also mentioning the snapshot use case).
> >
> > > +	if (!(vma->vm_flags & VM_SHARED))
> > > +		return &anon_uffd_ops;
> >
> > This is inconsistent with the rest of the VMA flags API usage, you want:
>
> This is consistent with the rest of VMA usage in uffd, we can convert it
> all in one go if you'd like.

It's new code, inconsistent with literally _the rest of the patch_. I literally
wrote the code for you and still you resist? :)

Also your patch makes uffd mix both methods so it feels like you're arguing with
yourself here? :P

>
> > 	if (!vma_test(vma, VMA_SHARED_BIT))
> > 		return &anon_uffd_ops;
> >
> > > +
> > > +	return ops;
> > > +}
> > > +
> > >  static __always_inline
> > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > >  {
> > > @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> > >  	return ret;
> > >  }
> > >
> > > +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> >
> > Hm it's a bit weird to have a VMA flags define outside of mm.h.
> >
> > But I definitely appreciate you using the new VMA flags stuff :)
> >
> > Probably just renaming to VMA_UFFD_COPY_STATE_FLAGS or something?
> >
> > > +
> >
> > I'm a bit of a broken record I know :) but a comment here would be helpful
> > describing what this is for.
> >
> > > +struct vma_snapshot {
> > > +	const struct vm_uffd_ops *copy_ops;
> > > +	const struct vm_uffd_ops *ops;
> > > +	struct file *file;
> > > +	vma_flags_t flags;
> > > +	pgoff_t pgoff;
> > > +};
> >
> > Yeah I don't love this naming, you're not snapshotting the VMA at all in
> > any real sense, you're comparing a subset of things specifically for the
> > UFFDIO_COPY case.
> >
> > Which is fine - but I think we should just make that clear here, because
> > perhaps in the future we'll want a generic version of something like this
> > and this is going to be super confusing if we do...
>
> Let's deal with it in the future then. This is a local structure, used by a
> few local functions.

I'd really rather we rename it. This is asking for confusion...

>
> > vma_uffdio_copy_snapshot maybe ?
> >
> > Maybe ok to leave the functions the same name to avoid a mouthful, the type
> > name should make it clear what's going on.
> >
> > > +
> > > +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
> >
> > No single letter var please! This isn't plan9 :)
>
> "C is a Spartan language, and your naming conventions should follow suit"
> https://docs.kernel.org/process/coding-style.html#naming
>
> Here it's perfectly understandable what 's' means.

That very document goes on to give suggestions that aren't single letter :)

I'm not going to die on this hill for a small function like this though, if you
are adamant!

Hey - let's not miss the wood for the trees? ;)

>
> > > +{
> > > +	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > > +	s->copy_ops = vma_uffd_copy_ops(vma);
> > > +	s->ops = vma_uffd_ops(vma);
> > > +	s->pgoff = vma->vm_pgoff;
> > > +
> > > +	if (vma->vm_file)
> > > +		s->file = get_file(vma->vm_file);
> > > +}
> > > +
> > > +static bool vma_snapshot_changed(struct vma_snapshot *s,
> >
> > > +				 struct vm_area_struct *vma)
> >
> > Can the VMA potentially be any arbitrary VMA that might now be remapped in
> > place? Like anon vs. file and same file and offset?
> >
> >
> > > +{
> > > +	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > > +
> >
> > You have a comment for each of the other cases, so one here is useful I
> > think, e.g.:
> >
> > 	/* Have any UFFD flags (missing, WP, minor) changed? */
> >
> > > +	if (!vma_flags_same_pair(&s->flags, &flags))
> > > +		return true;
> >
> > > +
> > > +	/* VMA type or effective uffd_ops changed while the lock was dropped */
> >
> > Can the 'effective' (probably better to say copy or use effective
> > throughout) uffd_ops change though? You already check that VMA_SHARED_BIT
> > is either set or unset across the lock drop above so is the copy_ops check
> > even meaningful here?
>
> Yes, it can change if a shmem VMA changes from SHARED to PRIVATE or vice
> versa.

How?

	if (!vma_flags_same_pair(&s->flags, &flags))
		return true;

Checks against VMA_SNAPSHOT_FLAGS which includes VMA_SHARED_BIT. So that would
catch it already no?

>
> > > +	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> > > +		return true;
> > Surely we should also check that the UFFD context matches (accounting for
> > NULL cases)?
>
> It's checked in mfill_get_vma().

Yeah it's a bit strange not to do that on a validation step though, would prefer
we live with the duplication or at least add a comment saying it's checked.

>
> > Could we not otherwise have weird situations across UFFD contexts if both
> > VMAs happen to contain identitical properties otherwise?
> >
> > This seems to be a canonical part of identifying a UFFD VMA so strange to
> > not see it here?
> >
> > > +
> > > +	/* VMA was anonymous before; changed only if it no longer is */
> > > +	if (!s->file)
> > > +		return !vma_is_anonymous(vma);
> > > +
> > > +	/* VMA was file backed, but inode or offset has changed */
> > > +	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> > > +	    vma->vm_pgoff != s->pgoff)
> > > +		return true;
> >
> > Hm so are we OK with a VMA where the file has been closed, then the same
> > inode re-opened at the exact same offset? (e.g. vma->vm_file differs)
>
> The copy will go to exactly the same place in that case.

I mean OK, but maybe it's better to be conserative? Is just a bit weird to
tolerate that when we pin the file we could just check file equality?

>
> > Would this be problematic with hugetlbfs or shmem?
> >
> > I guess we pin the inode so the f_inode check _should_ be valid, from that
> > side.
> >
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +static void vma_snapshot_put(struct vma_snapshot *s)
> > > +{
> > > +	if (s->file)
> > > +		fput(s->file);
> > > +}
> > > +
> > > +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> > > +
> > >  static int mfill_copy_folio_retry(struct mfill_state *state,
> > >  				  struct folio *folio)
> > >  {
> > > -	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> > > +	struct vma_snapshot s = { 0 };
> > > +	struct vma_snapshot *p __free(snapshot_put) = &s;
> >
> > Could we avoid these single letter var names?
>
> You suggest to call it a_dummy_pointer_to_allow_use_of_free_cleanup? ;-P

Or, and I know it's _crazy_, but like 'snapshot' and 'prev'(or whatever p is
supposed to be - see the problem??), or something vaguely greppable? :)

> >
> > >  	unsigned long src_addr = state->src_addr;
> > >  	void *kaddr;
> > >  	int err;
> > >
> > > +	vma_snapshot_get(&s, state->vma);
> > > +
> > >  	/* retry copying with mm_lock dropped */
> > >  	mfill_put_vma(state);
> > >
> > > @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
> > >  	if (err)
> > >  		return err;
> > >
> > > -	/*
> > > -	 * The VMA type may have changed while the lock was dropped
> > > -	 * (e.g. replaced with a hugetlb mapping), making the caller's
> > > -	 * ops pointer stale.
> > > -	 */
> >
> > This comment we can lose given we are explicitly comparing snapshots, but I
> > think it's not great to explicitly point out that we're doing this because
> > of the lock being dropped, we should say that somewhere.
> >
> > > -	if (vma_uffd_ops(state->vma) != orig_ops)
> > > +	if (vma_snapshot_changed(&s, state->vma))
> > >  		return -EAGAIN;
> > >
> > >  	err = mfill_establish_pmd(state);
> > > @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
> > >
> > >  static int mfill_atomic_pte_copy(struct mfill_state *state)
> > >  {
> > > -	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> > > -
> > > -	/*
> > > -	 * The normal page fault path for a MAP_PRIVATE mapping in a
> > > -	 * file-backed VMA will invoke the fault, fill the hole in the file and
> > > -	 * COW it right away. The result generates plain anonymous memory.
> > > -	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> > > -	 * generate anonymous memory directly without actually filling the
> > > -	 * hole. For the MAP_PRIVATE case the robustness check only happens in
> > > -	 * the pagetable (to verify it's still none) and not in the page cache.
> > > -	 */
> >
> > (As mentioned above) I hate that we lose this comment (or the gist of it at
> > least), can we please move it up to the copy ops function?
>
> Will move it before the copy_ops()

Thanks

>
> > > -	if (!(state->vma->vm_flags & VM_SHARED))
> > > -		ops = &anon_uffd_ops;
> >
> > A comment here would maybe be useful. But with a function rename and a
> > comment at the function itself maybe OK without.
> >
> > > +	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
> >
> > Can we please have a:
> >
> > 	/* Only async W/P can have missing ops, and that can't copy. */
> > 	VM_WARN_ON_ONCE(!ops);
>
> It would be the third time :-D

<insert old addage about repetition etc. etc.>

>
> > Check here to make it absolutely clear you're not running straight into a
> > NULL pointer deref in __mfill_atomic_pte()? Maybe worth putting a similar
> > assert there too.
> >
> > >
> > >  	return __mfill_atomic_pte(state, ops);
> > >  }
> > >
> > > base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> > > --
> > > 2.53.0
> > >
> >
> > Thanks, Lorenzo
>
> --
> Sincerely yours,
> Mike.

Cheers, Lorenzo

     prev parent reply	other threads:[~2026-05-27 16:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
2026-05-19  5:36 ` David CARLIER
2026-05-20 12:40   ` Mike Rapoport
2026-05-20 11:09 ` David Hildenbrand (Arm)
2026-05-20 12:53   ` Mike Rapoport
2026-05-20 13:48     ` David Hildenbrand (Arm)
2026-05-20 14:03       ` David Hildenbrand (Arm)
2026-05-26 15:23         ` Lorenzo Stoakes
2026-05-20 14:12       ` Mike Rapoport
2026-05-20 14:38         ` David Hildenbrand (Arm)
2026-05-25 17:12           ` Liam R. Howlett
2026-05-26 12:47             ` David Hildenbrand (Arm)
2026-05-26 15:25               ` Lorenzo Stoakes
2026-05-26 19:06                 ` Liam R. Howlett
2026-05-26 15:12 ` Lorenzo Stoakes
2026-05-26 17:30   ` Mike Rapoport
2026-05-27 16:08     ` Lorenzo Stoakes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahcUDJ-eZP2GRwOJ@lucifer \
    --to=ljs@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=devnexen@gmail.com \
    --cc=gganji11@naver.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=michael.bommarito@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox