Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <ljs@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 David Carlier <devnexen@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	 Heechan Kang <gganji11@naver.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	 Michael Bommarito <michael.bommarito@gmail.com>,
	Peter Xu <peterx@redhat.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
Date: Tue, 26 May 2026 16:12:56 +0100	[thread overview]
Message-ID: <ahWh7jsmLfY5awPR@lucifer> (raw)
In-Reply-To: <20260519052516.3315196-1-rppt@kernel.org>

On Tue, May 19, 2026 at 08:25:16AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
>
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.

OK this seems like a sensible subset of things to check - we are looking to
check what might materially impact _this particular operation_, and working
to the absolute minimum we need to check.

>
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.

I feel like this is possibly a little misleading, it sort of implies that
this is a broader kind of snapshot rather than some uffd-specific thing.

This is more of a naming thing, yes things being defined in
mm/userfaultfd.c tells you it's uffd-specific, but people are easily
confused by stuff like this.

Will comment inline.

>
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().

That's nice!

>
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.

I think this is a bit overly brief. And is it only shmem where this might
happen? Is MAP_PRIVATE hugetlbfs not also possible?

>
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")

Hmm this is the second multiple-fixes patch I've seen, maybe that is OK to
do :) I guess if this one, indivisble change fixes both then that's valid.

> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)

I find this a bit confusing, these are the uffd ops for... copy? Or you're
copying uffd ops?

I'd maybe rename it to vma_uffd_ops_on_copy() to make it clear which it is,
and add a comment like:

/*
 * Determine the ops to use when performing a UFFDIO_COPY operation - this
 * specifically handles the case of a MAP_PRIVATE file-backed mapping,
 * which, upon copy, is CoW'd into anonymous memory.
 */

From the snapshot point of view, I have been pondering if this was the
clearest way of checking things because you are, in effect checking a
specific case

it, but after much back and forth, perhaps it's not too crazy... you are in
most cases doing a duplicate check to the ops check

> +{
> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +	if (!ops)
> +		return NULL;

Is this correct? The code you removed from mfill_atomic_pte_copy()
unconditionally replaces the ops with &anon_uffd_ops, regardless of whether
any operations were specified, but now not providing ops avoids that?

I mean it's correct if we can never have !ops, and presumably we can't
because __mfill_atomic_pte() unconditionally calls ops->alloc_folio() right
at the top of the logic?

Doesn't uffd only support shmem, anon + hugetlbfs? So why are we adding
!ops checks at all?

Actually claude (:>) tells me that async-WP memory mode (ugh god) can be
used for _anything_:

	/*
	 * If WP is the only mode enabled and context is wp async, allow any
	 * memory type.
	 */
	if (wp_async && (vm_flags == VM_UFFD_WP))
		return true;

But. This function is never called in anything but a copy context?

I'm not sure if you were protecting against a possible future usage? But in
that case surely you'd just want to VM_WARN_ON_ONCE(!ops)?

We should also have a comment describing the situation like:

	/*
	 * Only async WP allows arbitrary file-backed mappings for UFFD,
	 * which is the only case in which ops can be NULL. However,
	 * we are copying here which is not permitted for these
         * mappings, so this will never be the case here.
	 */
	 VM_WARN_ON_ONCE(!ops);

Also in:

static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
{
	if (vma_is_anonymous(vma))
		return &anon_uffd_ops;
	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
}

This is doing a redundant check _and_ making life confusing, as if
!vma->vm_ops is a condition that can be reached there, it can't, as
vma_is_anonymous() is literally a !vma->vm_ops check :)

So surely that should be:

static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
{
	if (vma_is_anonymous(vma))
		return &anon_uffd_ops;

	return vma->vm_ops->uffd_ops;
}

I mean obviously that can be a separate patch, but that's not helped
make things clear here! :)

> +
> +	/*
> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +	 * memory. This is an effective ops override, so retry validation must
> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> +	 */

The wording here is super confusing. 'This is an effective ops override'
and 'retry validation' and also the 'not just'.

Also it makes zero reference to the invocation of this function from
mfill_atomic_pte_copy() - this function is pulling double-duty so this
comment isn't really quite correct.

I'd maybe pull up the comment from mfill_atomic_pte_copy() (and adapt it to
suit the changes, also mentioning the snapshot use case).

> +	if (!(vma->vm_flags & VM_SHARED))
> +		return &anon_uffd_ops;

This is inconsistent with the rest of the VMA flags API usage, you want:

	if (!vma_test(vma, VMA_SHARED_BIT))
		return &anon_uffd_ops;

> +
> +	return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>  	return ret;
>  }
>
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)

Hm it's a bit weird to have a VMA flags define outside of mm.h.

But I definitely appreciate you using the new VMA flags stuff :)

Probably just renaming to VMA_UFFD_COPY_STATE_FLAGS or something?

> +

I'm a bit of a broken record I know :) but a comment here would be helpful
describing what this is for.

> +struct vma_snapshot {
> +	const struct vm_uffd_ops *copy_ops;
> +	const struct vm_uffd_ops *ops;
> +	struct file *file;
> +	vma_flags_t flags;
> +	pgoff_t pgoff;
> +};

Yeah I don't love this naming, you're not snapshotting the VMA at all in
any real sense, you're comparing a subset of things specifically for the
UFFDIO_COPY case.

Which is fine - but I think we should just make that clear here, because
perhaps in the future we'll want a generic version of something like this
and this is going to be super confusing if we do...

vma_uffdio_copy_snapshot maybe ?

Maybe ok to leave the functions the same name to avoid a mouthful, the type
name should make it clear what's going on.

> +
> +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)

No single letter var please! This isn't plan9 :)

> +{
> +	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +	s->copy_ops = vma_uffd_copy_ops(vma);
> +	s->ops = vma_uffd_ops(vma);
> +	s->pgoff = vma->vm_pgoff;
> +
> +	if (vma->vm_file)
> +		s->file = get_file(vma->vm_file);
> +}
> +
> +static bool vma_snapshot_changed(struct vma_snapshot *s,

> +				 struct vm_area_struct *vma)

Can the VMA potentially be any arbitrary VMA that might now be remapped in
place? Like anon vs. file and same file and offset?

> +{
> +	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +

You have a comment for each of the other cases, so one here is useful I
think, e.g.:

	/* Have any UFFD flags (missing, WP, minor) changed? */

> +	if (!vma_flags_same_pair(&s->flags, &flags))
> +		return true;

> +
> +	/* VMA type or effective uffd_ops changed while the lock was dropped */

Can the 'effective' (probably better to say copy or use effective
throughout) uffd_ops change though? You already check that VMA_SHARED_BIT
is either set or unset across the lock drop above so is the copy_ops check
even meaningful here?

> +	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> +		return true;

Surely we should also check that the UFFD context matches (accounting for
NULL cases)?

Could we not otherwise have weird situations across UFFD contexts if both
VMAs happen to contain identitical properties otherwise?

This seems to be a canonical part of identifying a UFFD VMA so strange to
not see it here?

> +
> +	/* VMA was anonymous before; changed only if it no longer is */
> +	if (!s->file)
> +		return !vma_is_anonymous(vma);
> +
> +	/* VMA was file backed, but inode or offset has changed */
> +	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> +	    vma->vm_pgoff != s->pgoff)
> +		return true;

Hm so are we OK with a VMA where the file has been closed, then the same
inode re-opened at the exact same offset? (e.g. vma->vm_file differs)

Would this be problematic with hugetlbfs or shmem?

I guess we pin the inode so the f_inode check _should_ be valid, from that
side.

> +
> +	return false;
> +}
> +
> +static void vma_snapshot_put(struct vma_snapshot *s)
> +{
> +	if (s->file)
> +		fput(s->file);
> +}
> +
> +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> +
>  static int mfill_copy_folio_retry(struct mfill_state *state,
>  				  struct folio *folio)
>  {
> -	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> +	struct vma_snapshot s = { 0 };
> +	struct vma_snapshot *p __free(snapshot_put) = &s;

Could we avoid these single letter var names?

>  	unsigned long src_addr = state->src_addr;
>  	void *kaddr;
>  	int err;
>
> +	vma_snapshot_get(&s, state->vma);
> +
>  	/* retry copying with mm_lock dropped */
>  	mfill_put_vma(state);
>
> @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>  	if (err)
>  		return err;
>
> -	/*
> -	 * The VMA type may have changed while the lock was dropped
> -	 * (e.g. replaced with a hugetlb mapping), making the caller's
> -	 * ops pointer stale.
> -	 */

This comment we can lose given we are explicitly comparing snapshots, but I
think it's not great to explicitly point out that we're doing this because
of the lock being dropped, we should say that somewhere.

> -	if (vma_uffd_ops(state->vma) != orig_ops)
> +	if (vma_snapshot_changed(&s, state->vma))
>  		return -EAGAIN;
>
>  	err = mfill_establish_pmd(state);
> @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
>
>  static int mfill_atomic_pte_copy(struct mfill_state *state)
>  {
> -	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> -
> -	/*
> -	 * The normal page fault path for a MAP_PRIVATE mapping in a
> -	 * file-backed VMA will invoke the fault, fill the hole in the file and
> -	 * COW it right away. The result generates plain anonymous memory.
> -	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> -	 * generate anonymous memory directly without actually filling the
> -	 * hole. For the MAP_PRIVATE case the robustness check only happens in
> -	 * the pagetable (to verify it's still none) and not in the page cache.
> -	 */

(As mentioned above) I hate that we lose this comment (or the gist of it at
least), can we please move it up to the copy ops function?

> -	if (!(state->vma->vm_flags & VM_SHARED))
> -		ops = &anon_uffd_ops;

A comment here would maybe be useful. But with a function rename and a
comment at the function itself maybe OK without.

> +	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);

Can we please have a:

	/* Only async W/P can have missing ops, and that can't copy. */
	VM_WARN_ON_ONCE(!ops);

Check here to make it absolutely clear you're not running straight into a
NULL pointer deref in __mfill_atomic_pte()? Maybe worth putting a similar
assert there too.

>
>  	return __mfill_atomic_pte(state, ops);
>  }
>
> base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> --
> 2.53.0
>

Thanks, Lorenzo

next prev parent reply	other threads:[~2026-05-26 15:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
2026-05-19  5:36 ` David CARLIER
2026-05-20 12:40   ` Mike Rapoport
2026-05-20 11:09 ` David Hildenbrand (Arm)
2026-05-20 12:53   ` Mike Rapoport
2026-05-20 13:48     ` David Hildenbrand (Arm)
2026-05-20 14:03       ` David Hildenbrand (Arm)
2026-05-26 15:23         ` Lorenzo Stoakes
2026-05-20 14:12       ` Mike Rapoport
2026-05-20 14:38         ` David Hildenbrand (Arm)
2026-05-25 17:12           ` Liam R. Howlett
2026-05-26 12:47             ` David Hildenbrand (Arm)
2026-05-26 15:25               ` Lorenzo Stoakes
2026-05-26 19:06                 ` Liam R. Howlett
2026-05-26 15:12 ` Lorenzo Stoakes [this message]
2026-05-26 17:30   ` Mike Rapoport
2026-05-27 16:08     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahWh7jsmLfY5awPR@lucifer \
    --to=ljs@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=devnexen@gmail.com \
    --cc=gganji11@naver.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=michael.bommarito@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox