[PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
@ 2026-05-19  5:25 Mike Rapoport
  2026-05-19  5:36 ` David CARLIER
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mike Rapoport @ 2026-05-19  5:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Carlier, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Mike Rapoport, Peter Xu,
	linux-mm, linux-kernel

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
reacquires it afterwards. The destination VMA can be replaced during that
window.

The existing check compares vma_uffd_ops() before and after the retry, but
if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
MAP_PRIVATE (or vice versa) the replacement goes undetected.

The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
with shmem_alloc_folio() as anonymous and this will cause BUG() when
mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().

The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
the page cache of the original VMA.

Introduce helpers for more comprehensive comparison of VMA state:
- vma_snapshot_get() to save the relevant VMA state into a struct
  vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
  vm_file and pgoff) before dropping the lock
- vma_snapshot_changed() to compare the saved state with the state of the
  VMA acquired after retaking the locks
- vma_snapshot_put() to release vm_file pinning.

Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
error handling paths in mfill_copy_folio_retry().

Add vma_uffd_copy_ops() to avoid code duplication when original ops of
shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.

Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
Tested-by: Heechan Kang <gganji11@naver.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: David Carlier <devnexen@gmail.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 180bad42fc79..b70b84776a79 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,6 +14,8 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
+#include <linux/file.h>
+#include <linux/cleanup.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
 	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
 }
 
+static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
+{
+	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
+
+	if (!ops)
+		return NULL;
+
+	/*
+	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
+	 * memory. This is an effective ops override, so retry validation must
+	 * compare the override result, not just vma->vm_ops->uffd_ops.
+	 */
+	if (!(vma->vm_flags & VM_SHARED))
+		return &anon_uffd_ops;
+
+	return ops;
+}
+
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
@@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
 	return ret;
 }
 
+#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
+
+struct vma_snapshot {
+	const struct vm_uffd_ops *copy_ops;
+	const struct vm_uffd_ops *ops;
+	struct file *file;
+	vma_flags_t flags;
+	pgoff_t pgoff;
+};
+
+static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
+{
+	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
+	s->copy_ops = vma_uffd_copy_ops(vma);
+	s->ops = vma_uffd_ops(vma);
+	s->pgoff = vma->vm_pgoff;
+
+	if (vma->vm_file)
+		s->file = get_file(vma->vm_file);
+}
+
+static bool vma_snapshot_changed(struct vma_snapshot *s,
+				 struct vm_area_struct *vma)
+{
+	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
+
+	if (!vma_flags_same_pair(&s->flags, &flags))
+		return true;
+
+	/* VMA type or effective uffd_ops changed while the lock was dropped */
+	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
+		return true;
+
+	/* VMA was anonymous before; changed only if it no longer is */
+	if (!s->file)
+		return !vma_is_anonymous(vma);
+
+	/* VMA was file backed, but inode or offset has changed */
+	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
+	    vma->vm_pgoff != s->pgoff)
+		return true;
+
+	return false;
+}
+
+static void vma_snapshot_put(struct vma_snapshot *s)
+{
+	if (s->file)
+		fput(s->file);
+}
+
+DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
+
 static int mfill_copy_folio_retry(struct mfill_state *state,
 				  struct folio *folio)
 {
-	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
+	struct vma_snapshot s = { 0 };
+	struct vma_snapshot *p __free(snapshot_put) = &s;
 	unsigned long src_addr = state->src_addr;
 	void *kaddr;
 	int err;
 
+	vma_snapshot_get(&s, state->vma);
+
 	/* retry copying with mm_lock dropped */
 	mfill_put_vma(state);
 
@@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
 	if (err)
 		return err;
 
-	/*
-	 * The VMA type may have changed while the lock was dropped
-	 * (e.g. replaced with a hugetlb mapping), making the caller's
-	 * ops pointer stale.
-	 */
-	if (vma_uffd_ops(state->vma) != orig_ops)
+	if (vma_snapshot_changed(&s, state->vma))
 		return -EAGAIN;
 
 	err = mfill_establish_pmd(state);
@@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
 
 static int mfill_atomic_pte_copy(struct mfill_state *state)
 {
-	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
-
-	/*
-	 * The normal page fault path for a MAP_PRIVATE mapping in a
-	 * file-backed VMA will invoke the fault, fill the hole in the file and
-	 * COW it right away. The result generates plain anonymous memory.
-	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
-	 * generate anonymous memory directly without actually filling the
-	 * hole. For the MAP_PRIVATE case the robustness check only happens in
-	 * the pagetable (to verify it's still none) and not in the page cache.
-	 */
-	if (!(state->vma->vm_flags & VM_SHARED))
-		ops = &anon_uffd_ops;
+	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
 
 	return __mfill_atomic_pte(state, ops);
 }

base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
@ 2026-05-19  5:36 ` David CARLIER
  2026-05-20 12:40   ` Mike Rapoport
  2026-05-20 11:09 ` David Hildenbrand (Arm)
  2026-05-26 15:12 ` Lorenzo Stoakes
  2 siblings, 1 reply; 16+ messages in thread
From: David CARLIER @ 2026-05-19  5:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

Hi Mike,

On Tue, 19 May 2026 at 06:25, Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
>
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.
>
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.
>
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().
>
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
>
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>         return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)


My only 2 cent, I would name it vma_uffd_effective_copy_ops() instead or
a comment to highlight it is about "UFFDIO_COPY into a MAP_PRIVATE file-backed"


> +{
> +       const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +       if (!ops)
> +               return NULL;
> +
> +       /*
> +        * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +        * memory. This is an effective ops override, so retry validation must
> +        * compare the override result, not just vma->vm_ops->uffd_ops.
> +        */
> +       if (!(vma->vm_flags & VM_SHARED))
> +               return &anon_uffd_ops;
> +
> +       return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>         return ret;
>  }
>
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> +
> +struct vma_snapshot {
> +       const struct vm_uffd_ops *copy_ops;
> +       const struct vm_uffd_ops *ops;
> +       struct file *file;
> +       vma_flags_t flags;
> +       pgoff_t pgoff;
> +};
> +
> +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
> +{
> +       s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +       s->copy_ops = vma_uffd_copy_ops(vma);
> +       s->ops = vma_uffd_ops(vma);
> +       s->pgoff = vma->vm_pgoff;
> +
> +       if (vma->vm_file)
> +               s->file = get_file(vma->vm_file);
> +}
> +
> +static bool vma_snapshot_changed(struct vma_snapshot *s,
> +                                struct vm_area_struct *vma)
> +{
> +       vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +
> +       if (!vma_flags_same_pair(&s->flags, &flags))
> +               return true;
> +
> +       /* VMA type or effective uffd_ops changed while the lock was dropped */
> +       if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> +               return true;
> +
> +       /* VMA was anonymous before; changed only if it no longer is */
> +       if (!s->file)
> +               return !vma_is_anonymous(vma);
> +
> +       /* VMA was file backed, but inode or offset has changed */
> +       if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> +           vma->vm_pgoff != s->pgoff)
> +               return true;
> +
> +       return false;
> +}
> +
> +static void vma_snapshot_put(struct vma_snapshot *s)
> +{
> +       if (s->file)
> +               fput(s->file);
> +}
> +
> +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> +
>  static int mfill_copy_folio_retry(struct mfill_state *state,
>                                   struct folio *folio)
>  {
> -       const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> +       struct vma_snapshot s = { 0 };
> +       struct vma_snapshot *p __free(snapshot_put) = &s;
>         unsigned long src_addr = state->src_addr;
>         void *kaddr;
>         int err;
>
> +       vma_snapshot_get(&s, state->vma);
> +
>         /* retry copying with mm_lock dropped */
>         mfill_put_vma(state);
>
> @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>         if (err)
>                 return err;
>
> -       /*
> -        * The VMA type may have changed while the lock was dropped
> -        * (e.g. replaced with a hugetlb mapping), making the caller's
> -        * ops pointer stale.
> -        */
> -       if (vma_uffd_ops(state->vma) != orig_ops)
> +       if (vma_snapshot_changed(&s, state->vma))
>                 return -EAGAIN;
>
>         err = mfill_establish_pmd(state);
> @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
>
>  static int mfill_atomic_pte_copy(struct mfill_state *state)
>  {
> -       const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> -
> -       /*
> -        * The normal page fault path for a MAP_PRIVATE mapping in a
> -        * file-backed VMA will invoke the fault, fill the hole in the file and
> -        * COW it right away. The result generates plain anonymous memory.
> -        * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> -        * generate anonymous memory directly without actually filling the
> -        * hole. For the MAP_PRIVATE case the robustness check only happens in
> -        * the pagetable (to verify it's still none) and not in the page cache.
> -        */
> -       if (!(state->vma->vm_flags & VM_SHARED))
> -               ops = &anon_uffd_ops;
> +       const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
>
>         return __mfill_atomic_pte(state, ops);
>  }
>
> base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> --
> 2.53.0
>

Cheers.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
  2026-05-19  5:36 ` David CARLIER
@ 2026-05-20 11:09 ` David Hildenbrand (Arm)
  2026-05-20 12:53   ` Mike Rapoport
  2026-05-26 15:12 ` Lorenzo Stoakes
  2 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 11:09 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: David Carlier, Heechan Kang, Liam R. Howlett, Lorenzo Stoakes,
	Michael Bommarito, Peter Xu, linux-mm, linux-kernel

On 5/19/26 07:25, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
> 
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
> 
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> 
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.
> 
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.
> 
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().
> 
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> 
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>  
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> +{
> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +	if (!ops)
> +		return NULL;
> +
> +	/*
> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +	 * memory. This is an effective ops override, so retry validation must
> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> +	 */
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return &anon_uffd_ops;
> +
> +	return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>  	return ret;
>  }
>  
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> +
> +struct vma_snapshot {
> +	const struct vm_uffd_ops *copy_ops;
> +	const struct vm_uffd_ops *ops;
> +	struct file *file;
> +	vma_flags_t flags;
> +	pgoff_t pgoff;
> +};

As this is all uffd specific, I wonder whether that should be "struct
uffd_vma_snapshot"/"vma_uffd_snapshot" etc.

From a high level, this LGTM.

I wish we could identify relevant VMA changes more easily. Like, using a per-MM
sequence counter that we simply increment on any VMA changes.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:36 ` David CARLIER
@ 2026-05-20 12:40   ` Mike Rapoport
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Rapoport @ 2026-05-20 12:40 UTC (permalink / raw)
  To: David CARLIER
  Cc: Andrew Morton, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

Hi David,

On Tue, May 19, 2026 at 06:36:23AM +0100, David CARLIER wrote:
> On Tue, 19 May 2026 at 06:25, Mike Rapoport <rppt@kernel.org> wrote:
>
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >         return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> 
> 
> My only 2 cent, I would name it vma_uffd_effective_copy_ops() instead or
> a comment to highlight it is about "UFFDIO_COPY into a MAP_PRIVATE file-backed"
 
Too long for my taste :)
And comment is useful anyway as it explains why we override the ops at all.
 
> > +{
> > +       const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +       if (!ops)
> > +               return NULL;
> > +
> > +       /*
> > +        * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +        * memory. This is an effective ops override, so retry validation must
> > +        * compare the override result, not just vma->vm_ops->uffd_ops.
> > +        */
> > +       if (!(vma->vm_flags & VM_SHARED))
> > +               return &anon_uffd_ops;
> > +
> > +       return ops;
> > +}

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 11:09 ` David Hildenbrand (Arm)
@ 2026-05-20 12:53   ` Mike Rapoport
  2026-05-20 13:48     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Rapoport @ 2026-05-20 12:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> On 5/19/26 07:25, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> > reacquires it afterwards. The destination VMA can be replaced during that
> > window.
> > 
> > The existing check compares vma_uffd_ops() before and after the retry, but
> > if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> > MAP_PRIVATE (or vice versa) the replacement goes undetected.
> > 
> > The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> > with shmem_alloc_folio() as anonymous and this will cause BUG() when
> > mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> > 
> > The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> > the page cache of the original VMA.
> > 
> > Introduce helpers for more comprehensive comparison of VMA state:
> > - vma_snapshot_get() to save the relevant VMA state into a struct
> >   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
> >   vm_file and pgoff) before dropping the lock
> > - vma_snapshot_changed() to compare the saved state with the state of the
> >   VMA acquired after retaking the locks
> > - vma_snapshot_put() to release vm_file pinning.
> > 
> > Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> > error handling paths in mfill_copy_folio_retry().
> > 
> > Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> > shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> > 
> > Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> > Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> > Tested-by: Heechan Kang <gganji11@naver.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Co-developed-by: David Carlier <devnexen@gmail.com>
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 79 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 180bad42fc79..b70b84776a79 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -14,6 +14,8 @@
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/file.h>
> > +#include <linux/cleanup.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >  
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> > +{
> > +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +	if (!ops)
> > +		return NULL;
> > +
> > +	/*
> > +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +	 * memory. This is an effective ops override, so retry validation must
> > +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> > +	 */
> > +	if (!(vma->vm_flags & VM_SHARED))
> > +		return &anon_uffd_ops;
> > +
> > +	return ops;
> > +}
> > +
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> >  {
> > @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> >  	return ret;
> >  }
> >  
> > +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> > +
> > +struct vma_snapshot {
> > +	const struct vm_uffd_ops *copy_ops;
> > +	const struct vm_uffd_ops *ops;
> > +	struct file *file;
> > +	vma_flags_t flags;
> > +	pgoff_t pgoff;
> > +};
> 
> As this is all uffd specific, I wonder whether that should be "struct
> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.

It's local to this file, I wouldn't worry about namespacing until there's
an actual need if it will ever arise.
 
> From a high level, this LGTM.
> 
> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
> sequence counter that we simply increment on any VMA changes.

Do you mean per-VMA?
Per-MM counter would capture unrelated changes, e.g. an masvise() for
unrelated range.

We kinda have an infrastructure to detect VMA changes that affect uffd
operation for non-cooperative, we can also hook on that even if
UFFD_EVENT_* are not requested, but that's way more involved that this VMA
snapshot.
 
> -- 
> Cheers,
> 
> David

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 12:53   ` Mike Rapoport
@ 2026-05-20 13:48     ` David Hildenbrand (Arm)
  2026-05-20 14:03       ` David Hildenbrand (Arm)
  2026-05-20 14:12       ` Mike Rapoport
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 13:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 14:53, Mike Rapoport wrote:
> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/19/26 07:25, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
>>> reacquires it afterwards. The destination VMA can be replaced during that
>>> window.
>>>
>>> The existing check compares vma_uffd_ops() before and after the retry, but
>>> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
>>> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>>>
>>> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
>>> with shmem_alloc_folio() as anonymous and this will cause BUG() when
>>> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>>>
>>> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
>>> the page cache of the original VMA.
>>>
>>> Introduce helpers for more comprehensive comparison of VMA state:
>>> - vma_snapshot_get() to save the relevant VMA state into a struct
>>>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>>>   vm_file and pgoff) before dropping the lock
>>> - vma_snapshot_changed() to compare the saved state with the state of the
>>>   VMA acquired after retaking the locks
>>> - vma_snapshot_put() to release vm_file pinning.
>>>
>>> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
>>> error handling paths in mfill_copy_folio_retry().
>>>
>>> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
>>> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
>>>
>>> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
>>> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
>>> Tested-by: Heechan Kang <gganji11@naver.com>
>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>> Co-developed-by: David Carlier <devnexen@gmail.com>
>>> Signed-off-by: David Carlier <devnexen@gmail.com>
>>> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
>>> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
>>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> ---
>>>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 79 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>> index 180bad42fc79..b70b84776a79 100644
>>> --- a/mm/userfaultfd.c
>>> +++ b/mm/userfaultfd.c
>>> @@ -14,6 +14,8 @@
>>>  #include <linux/userfaultfd_k.h>
>>>  #include <linux/mmu_notifier.h>
>>>  #include <linux/hugetlb.h>
>>> +#include <linux/file.h>
>>> +#include <linux/cleanup.h>
>>>  #include <asm/tlbflush.h>
>>>  #include <asm/tlb.h>
>>>  #include "internal.h"
>>> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>>>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>>>  }
>>>  
>>> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
>>> +{
>>> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
>>> +
>>> +	if (!ops)
>>> +		return NULL;
>>> +
>>> +	/*
>>> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
>>> +	 * memory. This is an effective ops override, so retry validation must
>>> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
>>> +	 */
>>> +	if (!(vma->vm_flags & VM_SHARED))
>>> +		return &anon_uffd_ops;
>>> +
>>> +	return ops;
>>> +}
>>> +
>>>  static __always_inline
>>>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>>>  {
>>> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>>>  	return ret;
>>>  }
>>>  
>>> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
>>> +
>>> +struct vma_snapshot {
>>> +	const struct vm_uffd_ops *copy_ops;
>>> +	const struct vm_uffd_ops *ops;
>>> +	struct file *file;
>>> +	vma_flags_t flags;
>>> +	pgoff_t pgoff;
>>> +};
>>
>> As this is all uffd specific, I wonder whether that should be "struct
>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
> 
> It's local to this file, I wouldn't worry about namespacing until there's
> an actual need if it will ever arise.

Indeed, fair enough.

>  
>> From a high level, this LGTM.
>>
>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
>> sequence counter that we simply increment on any VMA changes.
> 
> Do you mean per-VMA?
> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> unrelated range.

I was thinking about a per-MM thing for simplicity. If there were any changes,
we'd retry (-EAGAIN).

IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.

Not sure if unrelated changes would really be a problem in practice (another
thread gabbing the mmap_lock in write mode until we serviced the fault).

> 
> We kinda have an infrastructure to detect VMA changes that affect uffd
> operation for non-cooperative, we can also hook on that even if
> UFFD_EVENT_* are not requested, but that's way more involved that this VMA
> snapshot.

Yeah, that sounds too complicated and what you have here is simpler.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 13:48     ` David Hildenbrand (Arm)
@ 2026-05-20 14:03       ` David Hildenbrand (Arm)
  2026-05-26 15:23         ` Lorenzo Stoakes
  2026-05-20 14:12       ` Mike Rapoport
  1 sibling, 1 reply; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 14:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 15:48, David Hildenbrand (Arm) wrote:
> On 5/20/26 14:53, Mike Rapoport wrote:
>> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> As this is all uffd specific, I wonder whether that should be "struct
>>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
>>
>> It's local to this file, I wouldn't worry about namespacing until there's
>> an actual need if it will ever arise.
> 
> Indeed, fair enough.
> 
>>  
>>> From a high level, this LGTM.
>>>
>>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
>>> sequence counter that we simply increment on any VMA changes.
>>
>> Do you mean per-VMA?
>> Per-MM counter would capture unrelated changes, e.g. an masvise() for
>> unrelated range.
> 
> I was thinking about a per-MM thing for simplicity. If there were any changes,
> we'd retry (-EAGAIN).
> 
> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> 
> Not sure if unrelated changes would really be a problem in practice (another
> thread gabbing the mmap_lock in write mode until we serviced the fault).

Assume something like the following:


diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 180bad42fc79..608bb3c3a613 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -448,9 +448,12 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
 {
        const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
        unsigned long src_addr = state->src_addr;
+       unsigned int seq;
        void *kaddr;
        int err;

+       seq = raw_read_seqcount(&mm->mm_lock_seq);
+
        /* retry copying with mm_lock dropped */
        mfill_put_vma(state);

@@ -472,7 +475,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
         * (e.g. replaced with a hugetlb mapping), making the caller's
         * ops pointer stale.
         */
-       if (vma_uffd_ops(state->vma) != orig_ops)
+       if (read_seqcount_retry(&mm->mm_lock_seq, seq))
                return -EAGAIN;

        err = mfill_establish_pmd(state);


In case there is concurrent mmap_write lock, we'd retry. But we triggered and
resolved the fault. So the retry would just succeed in copying from the page,
not triggering another fault here.

So I'd not expect this to matter in practice?


We'd have to implement &mm->mm_lock_seq also without per-VMA locks. Which
wouldn't be too bad given that Dave proposed a patch set to just enable per-VMA
locks unconditionally.

As a hotfix, what you have is probably better, but a simplification like this
might later make sense.

-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 13:48     ` David Hildenbrand (Arm)
  2026-05-20 14:03       ` David Hildenbrand (Arm)
@ 2026-05-20 14:12       ` Mike Rapoport
  2026-05-20 14:38         ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 16+ messages in thread
From: Mike Rapoport @ 2026-05-20 14:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Wed, May 20, 2026 at 03:48:02PM +0200, David Hildenbrand (Arm) wrote:
> On 5/20/26 14:53, Mike Rapoport wrote:
> > On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/19/26 07:25, Mike Rapoport wrote:
> > 
> > Do you mean per-VMA?
> > Per-MM counter would capture unrelated changes, e.g. an masvise() for
> > unrelated range.
> 
> I was thinking about a per-MM thing for simplicity. If there were any changes,
> we'd retry (-EAGAIN).
> 
> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> 
> Not sure if unrelated changes would really be a problem in practice (another
> thread gabbing the mmap_lock in write mode until we serviced the fault).

Let me reiterate:

A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
re-gets the VMA and checks if the per-MM counter stayed the same.

If another thread makes any change to VMA while mfill_copy_folio_retry()
waits to re-get the VMA, the counter would be incremented by another
thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
and will bail out with -EAGAIN.

So if another thread does, e.g. MADV_DONTFORK on completely unrelated VMA,
mfill_copy_folio_retry() will return -EAGAIN and I'm not sure we'll not
break the existing userspace by this.

> -- 
> Cheers,
> 
> David

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 14:12       ` Mike Rapoport
@ 2026-05-20 14:38         ` David Hildenbrand (Arm)
  2026-05-25 17:12           ` Liam R. Howlett
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 14:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 16:12, Mike Rapoport wrote:
> On Wed, May 20, 2026 at 03:48:02PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/20/26 14:53, Mike Rapoport wrote:
>>>
>>> Do you mean per-VMA?
>>> Per-MM counter would capture unrelated changes, e.g. an masvise() for
>>> unrelated range.
>>
>> I was thinking about a per-MM thing for simplicity. If there were any changes,
>> we'd retry (-EAGAIN).
>>
>> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
>>
>> Not sure if unrelated changes would really be a problem in practice (another
>> thread gabbing the mmap_lock in write mode until we serviced the fault).
> 
> Let me reiterate:
> 
> A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
> re-gets the VMA and checks if the per-MM counter stayed the same.
> 
> If another thread makes any change to VMA while mfill_copy_folio_retry()
> waits to re-get the VMA, the counter would be incremented by another
> thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
> and will bail out with -EAGAIN.
> 

Yeah.

> So if another thread does, e.g. MADV_DONTFORK on completely unrelated VMA,
> mfill_copy_folio_retry() will return -EAGAIN and I'm not sure we'll not
> break the existing userspace by this.

Ah, you mean that -EAGAIN would not be handled by user space already even though
documented in the man page ... because it relies on -EAGAIN not happening in
specific situations (i.e., copying a single page).

That's very likely true and we'd have to retry internally, which makes it more
complicated indeed.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 14:38         ` David Hildenbrand (Arm)
@ 2026-05-25 17:12           ` Liam R. Howlett
  2026-05-26 12:47             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 16+ messages in thread
From: Liam R. Howlett @ 2026-05-25 17:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Mike Rapoport, Andrew Morton, David Carlier, Heechan Kang,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 26/05/20 04:38PM, David Hildenbrand (Arm) wrote:
> On 5/20/26 16:12, Mike Rapoport wrote:
> > On Wed, May 20, 2026 at 03:48:02PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/20/26 14:53, Mike Rapoport wrote:
> >>>
> >>> Do you mean per-VMA?
> >>> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> >>> unrelated range.
> >>
> >> I was thinking about a per-MM thing for simplicity. If there were any changes,
> >> we'd retry (-EAGAIN).
> >>
> >> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> >>
> >> Not sure if unrelated changes would really be a problem in practice (another
> >> thread gabbing the mmap_lock in write mode until we serviced the fault).
> > 
> > Let me reiterate:
> > 
> > A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
> > re-gets the VMA and checks if the per-MM counter stayed the same.
> > 
> > If another thread makes any change to VMA while mfill_copy_folio_retry()
> > waits to re-get the VMA, the counter would be incremented by another
> > thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
> > and will bail out with -EAGAIN.
> > 
> 
> Yeah.

This isn't bulletproof anyways.  The sequence count can wrap.  So, if
someone can replace the vma then cause the sequence counter wrap, then
you can be fooled into thinking it's okay (we had this problem years ago
with the vmacache using a 32 bit counter, iirc).

> 
> > So if another thread does, e.g. MADV_DONTFORK on completely unrelated VMA,
> > mfill_copy_folio_retry() will return -EAGAIN and I'm not sure we'll not
> > break the existing userspace by this.
> 
> Ah, you mean that -EAGAIN would not be handled by user space already even though
> documented in the man page ... because it relies on -EAGAIN not happening in
> specific situations (i.e., copying a single page).
> 
> That's very likely true and we'd have to retry internally, which makes it more
> complicated indeed.
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-25 17:12           ` Liam R. Howlett
@ 2026-05-26 12:47             ` David Hildenbrand (Arm)
  2026-05-26 15:25               ` Lorenzo Stoakes
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-26 12:47 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Mike Rapoport, Andrew Morton, David Carlier, Heechan Kang,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/25/26 19:12, Liam R. Howlett wrote:
> On 26/05/20 04:38PM, David Hildenbrand (Arm) wrote:
>> On 5/20/26 16:12, Mike Rapoport wrote:
>>>
>>> Let me reiterate:
>>>
>>> A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
>>> re-gets the VMA and checks if the per-MM counter stayed the same.
>>>
>>> If another thread makes any change to VMA while mfill_copy_folio_retry()
>>> waits to re-get the VMA, the counter would be incremented by another
>>> thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
>>> and will bail out with -EAGAIN.
>>>
>>
>> Yeah.
> 
> This isn't bulletproof anyways.  The sequence count can wrap.  So, if
> someone can replace the vma then cause the sequence counter wrap, then
> you can be fooled into thinking it's okay (we had this problem years ago
> with the vmacache using a 32 bit counter, iirc).

If you can get it to wrap for such short durations, then how would sequence
counters possibly work in any reasonable context?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
  2026-05-19  5:36 ` David CARLIER
  2026-05-20 11:09 ` David Hildenbrand (Arm)
@ 2026-05-26 15:12 ` Lorenzo Stoakes
  2026-05-26 17:30   ` Mike Rapoport
  2 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-26 15:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, David Hildenbrand, Heechan Kang,
	Liam R. Howlett, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Tue, May 19, 2026 at 08:25:16AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
>
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.

OK this seems like a sensible subset of things to check - we are looking to
check what might materially impact _this particular operation_, and working
to the absolute minimum we need to check.

>
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.

I feel like this is possibly a little misleading, it sort of implies that
this is a broader kind of snapshot rather than some uffd-specific thing.

This is more of a naming thing, yes things being defined in
mm/userfaultfd.c tells you it's uffd-specific, but people are easily
confused by stuff like this.

Will comment inline.

>
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().

That's nice!

>
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.

I think this is a bit overly brief. And is it only shmem where this might
happen? Is MAP_PRIVATE hugetlbfs not also possible?

>
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")

Hmm this is the second multiple-fixes patch I've seen, maybe that is OK to
do :) I guess if this one, indivisble change fixes both then that's valid.

> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)

I find this a bit confusing, these are the uffd ops for... copy? Or you're
copying uffd ops?

I'd maybe rename it to vma_uffd_ops_on_copy() to make it clear which it is,
and add a comment like:

/*
 * Determine the ops to use when performing a UFFDIO_COPY operation - this
 * specifically handles the case of a MAP_PRIVATE file-backed mapping,
 * which, upon copy, is CoW'd into anonymous memory.
 */

From the snapshot point of view, I have been pondering if this was the
clearest way of checking things because you are, in effect checking a
specific case

it, but after much back and forth, perhaps it's not too crazy... you are in
most cases doing a duplicate check to the ops check

> +{
> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +	if (!ops)
> +		return NULL;

Is this correct? The code you removed from mfill_atomic_pte_copy()
unconditionally replaces the ops with &anon_uffd_ops, regardless of whether
any operations were specified, but now not providing ops avoids that?

I mean it's correct if we can never have !ops, and presumably we can't
because __mfill_atomic_pte() unconditionally calls ops->alloc_folio() right
at the top of the logic?

Doesn't uffd only support shmem, anon + hugetlbfs? So why are we adding
!ops checks at all?

Actually claude (:>) tells me that async-WP memory mode (ugh god) can be
used for _anything_:

	/*
	 * If WP is the only mode enabled and context is wp async, allow any
	 * memory type.
	 */
	if (wp_async && (vm_flags == VM_UFFD_WP))
		return true;

But. This function is never called in anything but a copy context?

I'm not sure if you were protecting against a possible future usage? But in
that case surely you'd just want to VM_WARN_ON_ONCE(!ops)?

We should also have a comment describing the situation like:

	/*
	 * Only async WP allows arbitrary file-backed mappings for UFFD,
	 * which is the only case in which ops can be NULL. However,
	 * we are copying here which is not permitted for these
         * mappings, so this will never be the case here.
	 */
	 VM_WARN_ON_ONCE(!ops);

Also in:

static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
{
	if (vma_is_anonymous(vma))
		return &anon_uffd_ops;
	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
}

This is doing a redundant check _and_ making life confusing, as if
!vma->vm_ops is a condition that can be reached there, it can't, as
vma_is_anonymous() is literally a !vma->vm_ops check :)

So surely that should be:

static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
{
	if (vma_is_anonymous(vma))
		return &anon_uffd_ops;

	return vma->vm_ops->uffd_ops;
}

I mean obviously that can be a separate patch, but that's not helped
make things clear here! :)

> +
> +	/*
> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +	 * memory. This is an effective ops override, so retry validation must
> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> +	 */

The wording here is super confusing. 'This is an effective ops override'
and 'retry validation' and also the 'not just'.

Also it makes zero reference to the invocation of this function from
mfill_atomic_pte_copy() - this function is pulling double-duty so this
comment isn't really quite correct.

I'd maybe pull up the comment from mfill_atomic_pte_copy() (and adapt it to
suit the changes, also mentioning the snapshot use case).

> +	if (!(vma->vm_flags & VM_SHARED))
> +		return &anon_uffd_ops;

This is inconsistent with the rest of the VMA flags API usage, you want:

	if (!vma_test(vma, VMA_SHARED_BIT))
		return &anon_uffd_ops;

> +
> +	return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>  	return ret;
>  }
>
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)

Hm it's a bit weird to have a VMA flags define outside of mm.h.

But I definitely appreciate you using the new VMA flags stuff :)

Probably just renaming to VMA_UFFD_COPY_STATE_FLAGS or something?

> +

I'm a bit of a broken record I know :) but a comment here would be helpful
describing what this is for.

> +struct vma_snapshot {
> +	const struct vm_uffd_ops *copy_ops;
> +	const struct vm_uffd_ops *ops;
> +	struct file *file;
> +	vma_flags_t flags;
> +	pgoff_t pgoff;
> +};

Yeah I don't love this naming, you're not snapshotting the VMA at all in
any real sense, you're comparing a subset of things specifically for the
UFFDIO_COPY case.

Which is fine - but I think we should just make that clear here, because
perhaps in the future we'll want a generic version of something like this
and this is going to be super confusing if we do...

vma_uffdio_copy_snapshot maybe ?

Maybe ok to leave the functions the same name to avoid a mouthful, the type
name should make it clear what's going on.

> +
> +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)

No single letter var please! This isn't plan9 :)

> +{
> +	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +	s->copy_ops = vma_uffd_copy_ops(vma);
> +	s->ops = vma_uffd_ops(vma);
> +	s->pgoff = vma->vm_pgoff;
> +
> +	if (vma->vm_file)
> +		s->file = get_file(vma->vm_file);
> +}
> +
> +static bool vma_snapshot_changed(struct vma_snapshot *s,

> +				 struct vm_area_struct *vma)

Can the VMA potentially be any arbitrary VMA that might now be remapped in
place? Like anon vs. file and same file and offset?

> +{
> +	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +

You have a comment for each of the other cases, so one here is useful I
think, e.g.:

	/* Have any UFFD flags (missing, WP, minor) changed? */

> +	if (!vma_flags_same_pair(&s->flags, &flags))
> +		return true;

> +
> +	/* VMA type or effective uffd_ops changed while the lock was dropped */

Can the 'effective' (probably better to say copy or use effective
throughout) uffd_ops change though? You already check that VMA_SHARED_BIT
is either set or unset across the lock drop above so is the copy_ops check
even meaningful here?

> +	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> +		return true;

Surely we should also check that the UFFD context matches (accounting for
NULL cases)?

Could we not otherwise have weird situations across UFFD contexts if both
VMAs happen to contain identitical properties otherwise?

This seems to be a canonical part of identifying a UFFD VMA so strange to
not see it here?

> +
> +	/* VMA was anonymous before; changed only if it no longer is */
> +	if (!s->file)
> +		return !vma_is_anonymous(vma);
> +
> +	/* VMA was file backed, but inode or offset has changed */
> +	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> +	    vma->vm_pgoff != s->pgoff)
> +		return true;

Hm so are we OK with a VMA where the file has been closed, then the same
inode re-opened at the exact same offset? (e.g. vma->vm_file differs)

Would this be problematic with hugetlbfs or shmem?

I guess we pin the inode so the f_inode check _should_ be valid, from that
side.

> +
> +	return false;
> +}
> +
> +static void vma_snapshot_put(struct vma_snapshot *s)
> +{
> +	if (s->file)
> +		fput(s->file);
> +}
> +
> +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> +
>  static int mfill_copy_folio_retry(struct mfill_state *state,
>  				  struct folio *folio)
>  {
> -	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> +	struct vma_snapshot s = { 0 };
> +	struct vma_snapshot *p __free(snapshot_put) = &s;

Could we avoid these single letter var names?

>  	unsigned long src_addr = state->src_addr;
>  	void *kaddr;
>  	int err;
>
> +	vma_snapshot_get(&s, state->vma);
> +
>  	/* retry copying with mm_lock dropped */
>  	mfill_put_vma(state);
>
> @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>  	if (err)
>  		return err;
>
> -	/*
> -	 * The VMA type may have changed while the lock was dropped
> -	 * (e.g. replaced with a hugetlb mapping), making the caller's
> -	 * ops pointer stale.
> -	 */

This comment we can lose given we are explicitly comparing snapshots, but I
think it's not great to explicitly point out that we're doing this because
of the lock being dropped, we should say that somewhere.

> -	if (vma_uffd_ops(state->vma) != orig_ops)
> +	if (vma_snapshot_changed(&s, state->vma))
>  		return -EAGAIN;
>
>  	err = mfill_establish_pmd(state);
> @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
>
>  static int mfill_atomic_pte_copy(struct mfill_state *state)
>  {
> -	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> -
> -	/*
> -	 * The normal page fault path for a MAP_PRIVATE mapping in a
> -	 * file-backed VMA will invoke the fault, fill the hole in the file and
> -	 * COW it right away. The result generates plain anonymous memory.
> -	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> -	 * generate anonymous memory directly without actually filling the
> -	 * hole. For the MAP_PRIVATE case the robustness check only happens in
> -	 * the pagetable (to verify it's still none) and not in the page cache.
> -	 */

(As mentioned above) I hate that we lose this comment (or the gist of it at
least), can we please move it up to the copy ops function?

> -	if (!(state->vma->vm_flags & VM_SHARED))
> -		ops = &anon_uffd_ops;

A comment here would maybe be useful. But with a function rename and a
comment at the function itself maybe OK without.

> +	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);

Can we please have a:

	/* Only async W/P can have missing ops, and that can't copy. */
	VM_WARN_ON_ONCE(!ops);

Check here to make it absolutely clear you're not running straight into a
NULL pointer deref in __mfill_atomic_pte()? Maybe worth putting a similar
assert there too.

>
>  	return __mfill_atomic_pte(state, ops);
>  }
>
> base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> --
> 2.53.0
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 14:03       ` David Hildenbrand (Arm)
@ 2026-05-26 15:23         ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-26 15:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Mike Rapoport, Andrew Morton, David Carlier, Heechan Kang,
	Liam R. Howlett, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Wed, May 20, 2026 at 04:03:03PM +0200, David Hildenbrand (Arm) wrote:
> On 5/20/26 15:48, David Hildenbrand (Arm) wrote:
> > On 5/20/26 14:53, Mike Rapoport wrote:
> >> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> >>>
> >>> As this is all uffd specific, I wonder whether that should be "struct
> >>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
> >>
> >> It's local to this file, I wouldn't worry about namespacing until there's
> >> an actual need if it will ever arise.
> >
> > Indeed, fair enough.
> >
> >>
> >>> From a high level, this LGTM.
> >>>
> >>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
> >>> sequence counter that we simply increment on any VMA changes.
> >>
> >> Do you mean per-VMA?
> >> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> >> unrelated range.
> >
> > I was thinking about a per-MM thing for simplicity. If there were any changes,
> > we'd retry (-EAGAIN).
> >
> > IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> >
> > Not sure if unrelated changes would really be a problem in practice (another
> > thread gabbing the mmap_lock in write mode until we serviced the fault).
>
> Assume something like the following:
>
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..608bb3c3a613 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -448,9 +448,12 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>  {
>         const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
>         unsigned long src_addr = state->src_addr;
> +       unsigned int seq;
>         void *kaddr;
>         int err;
>
> +       seq = raw_read_seqcount(&mm->mm_lock_seq);
> +
>         /* retry copying with mm_lock dropped */
>         mfill_put_vma(state);
>
> @@ -472,7 +475,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>          * (e.g. replaced with a hugetlb mapping), making the caller's
>          * ops pointer stale.
>          */
> -       if (vma_uffd_ops(state->vma) != orig_ops)
> +       if (read_seqcount_retry(&mm->mm_lock_seq, seq))
>                 return -EAGAIN;
>
>         err = mfill_establish_pmd(state);
>
>
> In case there is concurrent mmap_write lock, we'd retry. But we triggered and
> resolved the fault. So the retry would just succeed in copying from the page,
> not triggering another fault here.
>
> So I'd not expect this to matter in practice?
>
>
> We'd have to implement &mm->mm_lock_seq also without per-VMA locks. Which
> wouldn't be too bad given that Dave proposed a patch set to just enable per-VMA
> locks unconditionally.

That set unfortunately is broken for its ostensible purpose AFAICT, but the
idea of making per-VMA locks permanently available is good and I hope that
we land that at some point.

>
> As a hotfix, what you have is probably better, but a simplification like this
> might later make sense.

Hmmm, but the proposed logic absolutely tolerates changes to the VMA that
does not alter the fundamental properties being checked for, so this is an
entirely different check no?

I don't think we can both say that we only care about properties X, Y, and
Z changing and simultaneously tracking whether a VMA write lock was applied
since we dropped the lock?

Also a VMA write lock having been obtained doesn't mean the VMA was necessarily
changed?

And then couldn't -EAGAIN cause spurious write locks to abort the operation
altogether?

Given reacquisition of the lock will fall back to waiting on the
mmap_read_lock() if the VMA was write locked, that makes a race very
possible.

I mean maybe we want to tolerate all of the above and just move to a safer
world where we conservatively consider any VMA write lock acquisition to be
one in which we abort the operation, which I think would be nicer...

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-26 12:47             ` David Hildenbrand (Arm)
@ 2026-05-26 15:25               ` Lorenzo Stoakes
  2026-05-26 19:06                 ` Liam R. Howlett
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-26 15:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Liam R. Howlett, Mike Rapoport, Andrew Morton, David Carlier,
	Heechan Kang, Michael Bommarito, Peter Xu, linux-mm, linux-kernel

On Tue, May 26, 2026 at 02:47:45PM +0200, David Hildenbrand (Arm) wrote:
> On 5/25/26 19:12, Liam R. Howlett wrote:
> > On 26/05/20 04:38PM, David Hildenbrand (Arm) wrote:
> >> On 5/20/26 16:12, Mike Rapoport wrote:
> >>>
> >>> Let me reiterate:
> >>>
> >>> A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
> >>> re-gets the VMA and checks if the per-MM counter stayed the same.
> >>>
> >>> If another thread makes any change to VMA while mfill_copy_folio_retry()
> >>> waits to re-get the VMA, the counter would be incremented by another
> >>> thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
> >>> and will bail out with -EAGAIN.
> >>>
> >>
> >> Yeah.
> >
> > This isn't bulletproof anyways.  The sequence count can wrap.  So, if
> > someone can replace the vma then cause the sequence counter wrap, then
> > you can be fooled into thinking it's okay (we had this problem years ago
> > with the vmacache using a 32 bit counter, iirc).
>
> If you can get it to wrap for such short durations, then how would sequence
> counters possibly work in any reasonable context?

Surely even for a 32-bit value, we can be pretty confident we're not going to
see a wrap that matters (the seqnum will != the prev seqnum unless 4 billion VMA
write locks were obtained)?

I may be missing something though!

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-26 15:12 ` Lorenzo Stoakes
@ 2026-05-26 17:30   ` Mike Rapoport
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Rapoport @ 2026-05-26 17:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, David Carlier, David Hildenbrand, Heechan Kang,
	Liam R. Howlett, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Tue, May 26, 2026 at 04:12:56PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 19, 2026 at 08:25:16AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> > reacquires it afterwards. The destination VMA can be replaced during that
> > window.
> >
> > The existing check compares vma_uffd_ops() before and after the retry, but
> > if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> > MAP_PRIVATE (or vice versa) the replacement goes undetected.
> >
> > The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> > with shmem_alloc_folio() as anonymous and this will cause BUG() when
> > mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> >
> > The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> > the page cache of the original VMA.
> 
> OK this seems like a sensible subset of things to check - we are looking to
> check what might materially impact _this particular operation_, and working
> to the absolute minimum we need to check.
> 
> >
> > Introduce helpers for more comprehensive comparison of VMA state:
> > - vma_snapshot_get() to save the relevant VMA state into a struct
> >   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
> >   vm_file and pgoff) before dropping the lock
> > - vma_snapshot_changed() to compare the saved state with the state of the
> >   VMA acquired after retaking the locks
> > - vma_snapshot_put() to release vm_file pinning.
> 
> I feel like this is possibly a little misleading, it sort of implies that
> this is a broader kind of snapshot rather than some uffd-specific thing.
> 
> This is more of a naming thing, yes things being defined in
> mm/userfaultfd.c tells you it's uffd-specific, but people are easily
> confused by stuff like this.
> 
> Will comment inline.

tl;dr:
* will update comments
* will rename vma_uffd_copy_ops() and replace NULL checks with VM_WARN
* will change some of single letter variable names.
 
> >
> > Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> > error handling paths in mfill_copy_folio_retry().
> 
> That's nice!
> 
> >
> > Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> > shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> 
> I think this is a bit overly brief. And is it only shmem where this might
> happen? Is MAP_PRIVATE hugetlbfs not also possible?

No, hugetlb is not applicabe here, it was not changed to use
mfill_copy_folio_retry().
 
> >
> > Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> > Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> 
> Hmm this is the second multiple-fixes patch I've seen, maybe that is OK to
> do :) I guess if this one, indivisble change fixes both then that's valid.
> 
> > Tested-by: Heechan Kang <gganji11@naver.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Co-developed-by: David Carlier <devnexen@gmail.com>
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 79 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 180bad42fc79..b70b84776a79 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -14,6 +14,8 @@
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/file.h>
> > +#include <linux/cleanup.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> 
> I find this a bit confusing, these are the uffd ops for... copy? Or you're
> copying uffd ops?
> 
> I'd maybe rename it to vma_uffd_ops_on_copy() to make it clear which it is,
> and add a comment like:

vma_uffd_ops_for_copy()?
 
> /*
>  * Determine the ops to use when performing a UFFDIO_COPY operation - this
>  * specifically handles the case of a MAP_PRIVATE file-backed mapping,
>  * which, upon copy, is CoW'd into anonymous memory.
>  */

I'll pick the large comment that got dropped and stick it here.
 
> From the snapshot point of view, I have been pondering if this was the
> clearest way of checking things because you are, in effect checking a
> specific case
> 
> it, but after much back and forth, perhaps it's not too crazy... you are in
> most cases doing a duplicate check to the ops check
> 
> 
> 
> > +{
> > +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +	if (!ops)
> > +		return NULL;
> 
> Is this correct? The code you removed from mfill_atomic_pte_copy()
> unconditionally replaces the ops with &anon_uffd_ops, regardless of whether
> any operations were specified, but now not providing ops avoids that?
> 
> I mean it's correct if we can never have !ops, and presumably we can't
> because __mfill_atomic_pte() unconditionally calls ops->alloc_folio() right
> at the top of the logic?
> 
> Doesn't uffd only support shmem, anon + hugetlbfs? So why are we adding
> !ops checks at all?
> 
> Actually claude (:>) tells me that async-WP memory mode (ugh god) can be
> used for _anything_:
> 
> 	/*
> 	 * If WP is the only mode enabled and context is wp async, allow any
> 	 * memory type.
> 	 */
> 	if (wp_async && (vm_flags == VM_UFFD_WP))
> 		return true;
> 
> But. This function is never called in anything but a copy context?

wp_async has nothing to do with copy (or zeropage).
 
> I'm not sure if you were protecting against a possible future usage? But in
> that case surely you'd just want to VM_WARN_ON_ONCE(!ops)?
> 
> We should also have a comment describing the situation like:
> 
> 	/*
> 	 * Only async WP allows arbitrary file-backed mappings for UFFD,
> 	 * which is the only case in which ops can be NULL. However,
> 	 * we are copying here which is not permitted for these
>          * mappings, so this will never be the case here.
> 	 */

I can live with one sentence tops here ;-)

> 	 VM_WARN_ON_ONCE(!ops);
> 
> Also in:
> 
> static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> {
> 	if (vma_is_anonymous(vma))
> 		return &anon_uffd_ops;
> 	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> }
> 
> This is doing a redundant check _and_ making life confusing, as if
> !vma->vm_ops is a condition that can be reached there, it can't, as
> vma_is_anonymous() is literally a !vma->vm_ops check :)
> 
> So surely that should be:
> 
> static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> {
> 	if (vma_is_anonymous(vma))
> 		return &anon_uffd_ops;
> 
> 	return vma->vm_ops->uffd_ops;
> }
> 
> I mean obviously that can be a separate patch, but that's not helped
> make things clear here! :)
 
You are welcome to send it ;-)

> > +
> > +	/*
> > +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +	 * memory. This is an effective ops override, so retry validation must
> > +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> > +	 */
> 
> The wording here is super confusing. 'This is an effective ops override'
> and 'retry validation' and also the 'not just'.
> 
> Also it makes zero reference to the invocation of this function from
> mfill_atomic_pte_copy() - this function is pulling double-duty so this
> comment isn't really quite correct.
> 
> I'd maybe pull up the comment from mfill_atomic_pte_copy() (and adapt it to
> suit the changes, also mentioning the snapshot use case).
> 
> > +	if (!(vma->vm_flags & VM_SHARED))
> > +		return &anon_uffd_ops;
> 
> This is inconsistent with the rest of the VMA flags API usage, you want:

This is consistent with the rest of VMA usage in uffd, we can convert it
all in one go if you'd like.
 
> 	if (!vma_test(vma, VMA_SHARED_BIT))
> 		return &anon_uffd_ops;
> 
> > +
> > +	return ops;
> > +}
> > +
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> >  {
> > @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> >  	return ret;
> >  }
> >
> > +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> 
> Hm it's a bit weird to have a VMA flags define outside of mm.h.
> 
> But I definitely appreciate you using the new VMA flags stuff :)
> 
> Probably just renaming to VMA_UFFD_COPY_STATE_FLAGS or something?
>
> > +
> 
> I'm a bit of a broken record I know :) but a comment here would be helpful
> describing what this is for.
> 
> > +struct vma_snapshot {
> > +	const struct vm_uffd_ops *copy_ops;
> > +	const struct vm_uffd_ops *ops;
> > +	struct file *file;
> > +	vma_flags_t flags;
> > +	pgoff_t pgoff;
> > +};
> 
> Yeah I don't love this naming, you're not snapshotting the VMA at all in
> any real sense, you're comparing a subset of things specifically for the
> UFFDIO_COPY case.
> 
> Which is fine - but I think we should just make that clear here, because
> perhaps in the future we'll want a generic version of something like this
> and this is going to be super confusing if we do...

Let's deal with it in the future then. This is a local structure, used by a
few local functions.
 
> vma_uffdio_copy_snapshot maybe ?
> 
> Maybe ok to leave the functions the same name to avoid a mouthful, the type
> name should make it clear what's going on.
> 
> > +
> > +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
> 
> No single letter var please! This isn't plan9 :)

"C is a Spartan language, and your naming conventions should follow suit"
https://docs.kernel.org/process/coding-style.html#naming

Here it's perfectly understandable what 's' means.
 
> > +{
> > +	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > +	s->copy_ops = vma_uffd_copy_ops(vma);
> > +	s->ops = vma_uffd_ops(vma);
> > +	s->pgoff = vma->vm_pgoff;
> > +
> > +	if (vma->vm_file)
> > +		s->file = get_file(vma->vm_file);
> > +}
> > +
> > +static bool vma_snapshot_changed(struct vma_snapshot *s,
> 
> > +				 struct vm_area_struct *vma)
> 
> Can the VMA potentially be any arbitrary VMA that might now be remapped in
> place? Like anon vs. file and same file and offset?
> 
> 
> > +{
> > +	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> > +
> 
> You have a comment for each of the other cases, so one here is useful I
> think, e.g.:
> 
> 	/* Have any UFFD flags (missing, WP, minor) changed? */
> 
> > +	if (!vma_flags_same_pair(&s->flags, &flags))
> > +		return true;
> 
> > +
> > +	/* VMA type or effective uffd_ops changed while the lock was dropped */
> 
> Can the 'effective' (probably better to say copy or use effective
> throughout) uffd_ops change though? You already check that VMA_SHARED_BIT
> is either set or unset across the lock drop above so is the copy_ops check
> even meaningful here?

Yes, it can change if a shmem VMA changes from SHARED to PRIVATE or vice
versa.
 
> > +	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> > +		return true;
> Surely we should also check that the UFFD context matches (accounting for
> NULL cases)?

It's checked in mfill_get_vma().
 
> Could we not otherwise have weird situations across UFFD contexts if both
> VMAs happen to contain identitical properties otherwise?
> 
> This seems to be a canonical part of identifying a UFFD VMA so strange to
> not see it here?
> 
> > +
> > +	/* VMA was anonymous before; changed only if it no longer is */
> > +	if (!s->file)
> > +		return !vma_is_anonymous(vma);
> > +
> > +	/* VMA was file backed, but inode or offset has changed */
> > +	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> > +	    vma->vm_pgoff != s->pgoff)
> > +		return true;
> 
> Hm so are we OK with a VMA where the file has been closed, then the same
> inode re-opened at the exact same offset? (e.g. vma->vm_file differs)

The copy will go to exactly the same place in that case.
 
> Would this be problematic with hugetlbfs or shmem?
> 
> I guess we pin the inode so the f_inode check _should_ be valid, from that
> side.
> 
> > +
> > +	return false;
> > +}
> > +
> > +static void vma_snapshot_put(struct vma_snapshot *s)
> > +{
> > +	if (s->file)
> > +		fput(s->file);
> > +}
> > +
> > +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> > +
> >  static int mfill_copy_folio_retry(struct mfill_state *state,
> >  				  struct folio *folio)
> >  {
> > -	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> > +	struct vma_snapshot s = { 0 };
> > +	struct vma_snapshot *p __free(snapshot_put) = &s;
> 
> Could we avoid these single letter var names?

You suggest to call it a_dummy_pointer_to_allow_use_of_free_cleanup? ;-P
> 
> >  	unsigned long src_addr = state->src_addr;
> >  	void *kaddr;
> >  	int err;
> >
> > +	vma_snapshot_get(&s, state->vma);
> > +
> >  	/* retry copying with mm_lock dropped */
> >  	mfill_put_vma(state);
> >
> > @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
> >  	if (err)
> >  		return err;
> >
> > -	/*
> > -	 * The VMA type may have changed while the lock was dropped
> > -	 * (e.g. replaced with a hugetlb mapping), making the caller's
> > -	 * ops pointer stale.
> > -	 */
> 
> This comment we can lose given we are explicitly comparing snapshots, but I
> think it's not great to explicitly point out that we're doing this because
> of the lock being dropped, we should say that somewhere.
> 
> > -	if (vma_uffd_ops(state->vma) != orig_ops)
> > +	if (vma_snapshot_changed(&s, state->vma))
> >  		return -EAGAIN;
> >
> >  	err = mfill_establish_pmd(state);
> > @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
> >
> >  static int mfill_atomic_pte_copy(struct mfill_state *state)
> >  {
> > -	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> > -
> > -	/*
> > -	 * The normal page fault path for a MAP_PRIVATE mapping in a
> > -	 * file-backed VMA will invoke the fault, fill the hole in the file and
> > -	 * COW it right away. The result generates plain anonymous memory.
> > -	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> > -	 * generate anonymous memory directly without actually filling the
> > -	 * hole. For the MAP_PRIVATE case the robustness check only happens in
> > -	 * the pagetable (to verify it's still none) and not in the page cache.
> > -	 */
> 
> (As mentioned above) I hate that we lose this comment (or the gist of it at
> least), can we please move it up to the copy ops function?

Will move it before the copy_ops()
 
> > -	if (!(state->vma->vm_flags & VM_SHARED))
> > -		ops = &anon_uffd_ops;
> 
> A comment here would maybe be useful. But with a function rename and a
> comment at the function itself maybe OK without.
> 
> > +	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
> 
> Can we please have a:
> 
> 	/* Only async W/P can have missing ops, and that can't copy. */
> 	VM_WARN_ON_ONCE(!ops);

It would be the third time :-D
 
> Check here to make it absolutely clear you're not running straight into a
> NULL pointer deref in __mfill_atomic_pte()? Maybe worth putting a similar
> assert there too.
> 
> >
> >  	return __mfill_atomic_pte(state, ops);
> >  }
> >
> > base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> > --
> > 2.53.0
> >
> 
> Thanks, Lorenzo

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-26 15:25               ` Lorenzo Stoakes
@ 2026-05-26 19:06                 ` Liam R. Howlett
  0 siblings, 0 replies; 16+ messages in thread
From: Liam R. Howlett @ 2026-05-26 19:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand (Arm), Mike Rapoport, Andrew Morton,
	David Carlier, Heechan Kang, Michael Bommarito, Peter Xu,
	linux-mm, linux-kernel

On 26/05/26 04:25PM, Lorenzo Stoakes wrote:
> On Tue, May 26, 2026 at 02:47:45PM +0200, David Hildenbrand (Arm) wrote:
> > On 5/25/26 19:12, Liam R. Howlett wrote:
> > > On 26/05/20 04:38PM, David Hildenbrand (Arm) wrote:
> > >> On 5/20/26 16:12, Mike Rapoport wrote:
> > >>>
> > >>> Let me reiterate:
> > >>>
> > >>> A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
> > >>> re-gets the VMA and checks if the per-MM counter stayed the same.
> > >>>
> > >>> If another thread makes any change to VMA while mfill_copy_folio_retry()
> > >>> waits to re-get the VMA, the counter would be incremented by another
> > >>> thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
> > >>> and will bail out with -EAGAIN.
> > >>>
> > >>
> > >> Yeah.
> > >
> > > This isn't bulletproof anyways.  The sequence count can wrap.  So, if
> > > someone can replace the vma then cause the sequence counter wrap, then
> > > you can be fooled into thinking it's okay (we had this problem years ago
> > > with the vmacache using a 32 bit counter, iirc).
> >
> > If you can get it to wrap for such short durations, then how would sequence
> > counters possibly work in any reasonable context?

I think the answer is because 1. it may not be a short duration once
we're asking userspace to do things and 2. because we aren't depending
on the sequence number alone, usually.

I don't think we should trust it if we're dropping other
locks/guarantees and waiting on userspace.

> 
> Surely even for a 32-bit value, we can be pretty confident we're not going to
> see a wrap that matters (the seqnum will != the prev seqnum unless 4 billion VMA
> write locks were obtained)?

The vmacache had a similar issue and it was reported as a potential
security problem, which caused it to switch to a 64bit value. [1]  I
mean, it took 40 minutes, but it was possible..

Note that the vmacache was a bit different in that it was invalidating
the cache when the number wrapped.

And the vmacache was mm wide, while the per-vma sequence is mm and
per-vma specific, so it's very much harder to hit a false positive.
But, I guess if you can register a bunch of UFFDs and fork many
processes, you can increase your odds?

If you recall, we did try a 64 bit early on, but switched to a 32bit to
avoid a performance regression.

I think we should take a really hard look at the locking mechanics here
if we're going to try and depend on the sequence number before trusting
it as a correctness guarantee, especially when involving uffd copy.

[1]. https://www.exploit-db.com/exploits/45497

Thanks,
Liam

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-26 19:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
2026-05-19  5:36 ` David CARLIER
2026-05-20 12:40   ` Mike Rapoport
2026-05-20 11:09 ` David Hildenbrand (Arm)
2026-05-20 12:53   ` Mike Rapoport
2026-05-20 13:48     ` David Hildenbrand (Arm)
2026-05-20 14:03       ` David Hildenbrand (Arm)
2026-05-26 15:23         ` Lorenzo Stoakes
2026-05-20 14:12       ` Mike Rapoport
2026-05-20 14:38         ` David Hildenbrand (Arm)
2026-05-25 17:12           ` Liam R. Howlett
2026-05-26 12:47             ` David Hildenbrand (Arm)
2026-05-26 15:25               ` Lorenzo Stoakes
2026-05-26 19:06                 ` Liam R. Howlett
2026-05-26 15:12 ` Lorenzo Stoakes
2026-05-26 17:30   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox