Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
@ 2026-05-19  5:25 Mike Rapoport
  2026-05-19  5:36 ` David CARLIER
  2026-05-20 11:09 ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 9+ messages in thread
From: Mike Rapoport @ 2026-05-19  5:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Carlier, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Mike Rapoport, Peter Xu,
	linux-mm, linux-kernel

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
reacquires it afterwards. The destination VMA can be replaced during that
window.

The existing check compares vma_uffd_ops() before and after the retry, but
if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
MAP_PRIVATE (or vice versa) the replacement goes undetected.

The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
with shmem_alloc_folio() as anonymous and this will cause BUG() when
mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().

The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
the page cache of the original VMA.

Introduce helpers for more comprehensive comparison of VMA state:
- vma_snapshot_get() to save the relevant VMA state into a struct
  vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
  vm_file and pgoff) before dropping the lock
- vma_snapshot_changed() to compare the saved state with the state of the
  VMA acquired after retaking the locks
- vma_snapshot_put() to release vm_file pinning.

Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
error handling paths in mfill_copy_folio_retry().

Add vma_uffd_copy_ops() to avoid code duplication when original ops of
shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.

Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
Tested-by: Heechan Kang <gganji11@naver.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: David Carlier <devnexen@gmail.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 180bad42fc79..b70b84776a79 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,6 +14,8 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
+#include <linux/file.h>
+#include <linux/cleanup.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
 	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
 }
 
+static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
+{
+	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
+
+	if (!ops)
+		return NULL;
+
+	/*
+	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
+	 * memory. This is an effective ops override, so retry validation must
+	 * compare the override result, not just vma->vm_ops->uffd_ops.
+	 */
+	if (!(vma->vm_flags & VM_SHARED))
+		return &anon_uffd_ops;
+
+	return ops;
+}
+
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
@@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
 	return ret;
 }
 
+#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
+
+struct vma_snapshot {
+	const struct vm_uffd_ops *copy_ops;
+	const struct vm_uffd_ops *ops;
+	struct file *file;
+	vma_flags_t flags;
+	pgoff_t pgoff;
+};
+
+static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
+{
+	s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
+	s->copy_ops = vma_uffd_copy_ops(vma);
+	s->ops = vma_uffd_ops(vma);
+	s->pgoff = vma->vm_pgoff;
+
+	if (vma->vm_file)
+		s->file = get_file(vma->vm_file);
+}
+
+static bool vma_snapshot_changed(struct vma_snapshot *s,
+				 struct vm_area_struct *vma)
+{
+	vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
+
+	if (!vma_flags_same_pair(&s->flags, &flags))
+		return true;
+
+	/* VMA type or effective uffd_ops changed while the lock was dropped */
+	if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
+		return true;
+
+	/* VMA was anonymous before; changed only if it no longer is */
+	if (!s->file)
+		return !vma_is_anonymous(vma);
+
+	/* VMA was file backed, but inode or offset has changed */
+	if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
+	    vma->vm_pgoff != s->pgoff)
+		return true;
+
+	return false;
+}
+
+static void vma_snapshot_put(struct vma_snapshot *s)
+{
+	if (s->file)
+		fput(s->file);
+}
+
+DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
+
 static int mfill_copy_folio_retry(struct mfill_state *state,
 				  struct folio *folio)
 {
-	const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
+	struct vma_snapshot s = { 0 };
+	struct vma_snapshot *p __free(snapshot_put) = &s;
 	unsigned long src_addr = state->src_addr;
 	void *kaddr;
 	int err;
 
+	vma_snapshot_get(&s, state->vma);
+
 	/* retry copying with mm_lock dropped */
 	mfill_put_vma(state);
 
@@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
 	if (err)
 		return err;
 
-	/*
-	 * The VMA type may have changed while the lock was dropped
-	 * (e.g. replaced with a hugetlb mapping), making the caller's
-	 * ops pointer stale.
-	 */
-	if (vma_uffd_ops(state->vma) != orig_ops)
+	if (vma_snapshot_changed(&s, state->vma))
 		return -EAGAIN;
 
 	err = mfill_establish_pmd(state);
@@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
 
 static int mfill_atomic_pte_copy(struct mfill_state *state)
 {
-	const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
-
-	/*
-	 * The normal page fault path for a MAP_PRIVATE mapping in a
-	 * file-backed VMA will invoke the fault, fill the hole in the file and
-	 * COW it right away. The result generates plain anonymous memory.
-	 * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
-	 * generate anonymous memory directly without actually filling the
-	 * hole. For the MAP_PRIVATE case the robustness check only happens in
-	 * the pagetable (to verify it's still none) and not in the page cache.
-	 */
-	if (!(state->vma->vm_flags & VM_SHARED))
-		ops = &anon_uffd_ops;
+	const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
 
 	return __mfill_atomic_pte(state, ops);
 }

base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
@ 2026-05-19  5:36 ` David CARLIER
  2026-05-20 12:40   ` Mike Rapoport
  2026-05-20 11:09 ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 9+ messages in thread
From: David CARLIER @ 2026-05-19  5:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

Hi Mike,

On Tue, 19 May 2026 at 06:25, Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
>
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.
>
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.
>
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().
>
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
>
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>         return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)


My only 2 cent, I would name it vma_uffd_effective_copy_ops() instead or
a comment to highlight it is about "UFFDIO_COPY into a MAP_PRIVATE file-backed"


> +{
> +       const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +       if (!ops)
> +               return NULL;
> +
> +       /*
> +        * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +        * memory. This is an effective ops override, so retry validation must
> +        * compare the override result, not just vma->vm_ops->uffd_ops.
> +        */
> +       if (!(vma->vm_flags & VM_SHARED))
> +               return &anon_uffd_ops;
> +
> +       return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>         return ret;
>  }
>
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> +
> +struct vma_snapshot {
> +       const struct vm_uffd_ops *copy_ops;
> +       const struct vm_uffd_ops *ops;
> +       struct file *file;
> +       vma_flags_t flags;
> +       pgoff_t pgoff;
> +};
> +
> +static void vma_snapshot_get(struct vma_snapshot *s, struct vm_area_struct *vma)
> +{
> +       s->flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +       s->copy_ops = vma_uffd_copy_ops(vma);
> +       s->ops = vma_uffd_ops(vma);
> +       s->pgoff = vma->vm_pgoff;
> +
> +       if (vma->vm_file)
> +               s->file = get_file(vma->vm_file);
> +}
> +
> +static bool vma_snapshot_changed(struct vma_snapshot *s,
> +                                struct vm_area_struct *vma)
> +{
> +       vma_flags_t flags = vma_flags_and_mask(&vma->flags, VMA_SNAPSHOT_FLAGS);
> +
> +       if (!vma_flags_same_pair(&s->flags, &flags))
> +               return true;
> +
> +       /* VMA type or effective uffd_ops changed while the lock was dropped */
> +       if (s->ops != vma_uffd_ops(vma) || s->copy_ops != vma_uffd_copy_ops(vma))
> +               return true;
> +
> +       /* VMA was anonymous before; changed only if it no longer is */
> +       if (!s->file)
> +               return !vma_is_anonymous(vma);
> +
> +       /* VMA was file backed, but inode or offset has changed */
> +       if (!vma->vm_file || vma->vm_file->f_inode != s->file->f_inode ||
> +           vma->vm_pgoff != s->pgoff)
> +               return true;
> +
> +       return false;
> +}
> +
> +static void vma_snapshot_put(struct vma_snapshot *s)
> +{
> +       if (s->file)
> +               fput(s->file);
> +}
> +
> +DEFINE_FREE(snapshot_put, struct vma_snapshot *, if (_T) vma_snapshot_put(_T));
> +
>  static int mfill_copy_folio_retry(struct mfill_state *state,
>                                   struct folio *folio)
>  {
> -       const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> +       struct vma_snapshot s = { 0 };
> +       struct vma_snapshot *p __free(snapshot_put) = &s;
>         unsigned long src_addr = state->src_addr;
>         void *kaddr;
>         int err;
>
> +       vma_snapshot_get(&s, state->vma);
> +
>         /* retry copying with mm_lock dropped */
>         mfill_put_vma(state);
>
> @@ -467,12 +543,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
>         if (err)
>                 return err;
>
> -       /*
> -        * The VMA type may have changed while the lock was dropped
> -        * (e.g. replaced with a hugetlb mapping), making the caller's
> -        * ops pointer stale.
> -        */
> -       if (vma_uffd_ops(state->vma) != orig_ops)
> +       if (vma_snapshot_changed(&s, state->vma))
>                 return -EAGAIN;
>
>         err = mfill_establish_pmd(state);
> @@ -545,19 +616,7 @@ static int __mfill_atomic_pte(struct mfill_state *state,
>
>  static int mfill_atomic_pte_copy(struct mfill_state *state)
>  {
> -       const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> -
> -       /*
> -        * The normal page fault path for a MAP_PRIVATE mapping in a
> -        * file-backed VMA will invoke the fault, fill the hole in the file and
> -        * COW it right away. The result generates plain anonymous memory.
> -        * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> -        * generate anonymous memory directly without actually filling the
> -        * hole. For the MAP_PRIVATE case the robustness check only happens in
> -        * the pagetable (to verify it's still none) and not in the page cache.
> -        */
> -       if (!(state->vma->vm_flags & VM_SHARED))
> -               ops = &anon_uffd_ops;
> +       const struct vm_uffd_ops *ops = vma_uffd_copy_ops(state->vma);
>
>         return __mfill_atomic_pte(state, ops);
>  }
>
> base-commit: 444fc9435e57157fcf30fc99aee44997f3458641
> --
> 2.53.0
>

Cheers.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
  2026-05-19  5:36 ` David CARLIER
@ 2026-05-20 11:09 ` David Hildenbrand (Arm)
  2026-05-20 12:53   ` Mike Rapoport
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 11:09 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: David Carlier, Heechan Kang, Liam R. Howlett, Lorenzo Stoakes,
	Michael Bommarito, Peter Xu, linux-mm, linux-kernel

On 5/19/26 07:25, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> reacquires it afterwards. The destination VMA can be replaced during that
> window.
> 
> The existing check compares vma_uffd_ops() before and after the retry, but
> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> MAP_PRIVATE (or vice versa) the replacement goes undetected.
> 
> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> with shmem_alloc_folio() as anonymous and this will cause BUG() when
> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> 
> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> the page cache of the original VMA.
> 
> Introduce helpers for more comprehensive comparison of VMA state:
> - vma_snapshot_get() to save the relevant VMA state into a struct
>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>   vm_file and pgoff) before dropping the lock
> - vma_snapshot_changed() to compare the saved state with the state of the
>   VMA acquired after retaking the locks
> - vma_snapshot_put() to release vm_file pinning.
> 
> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> error handling paths in mfill_copy_folio_retry().
> 
> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> 
> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> Tested-by: Heechan Kang <gganji11@naver.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: David Carlier <devnexen@gmail.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..b70b84776a79 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> +#include <linux/file.h>
> +#include <linux/cleanup.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>  }
>  
> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> +{
> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> +
> +	if (!ops)
> +		return NULL;
> +
> +	/*
> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> +	 * memory. This is an effective ops override, so retry validation must
> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> +	 */
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return &anon_uffd_ops;
> +
> +	return ops;
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>  	return ret;
>  }
>  
> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> +
> +struct vma_snapshot {
> +	const struct vm_uffd_ops *copy_ops;
> +	const struct vm_uffd_ops *ops;
> +	struct file *file;
> +	vma_flags_t flags;
> +	pgoff_t pgoff;
> +};

As this is all uffd specific, I wonder whether that should be "struct
uffd_vma_snapshot"/"vma_uffd_snapshot" etc.

From a high level, this LGTM.

I wish we could identify relevant VMA changes more easily. Like, using a per-MM
sequence counter that we simply increment on any VMA changes.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-19  5:36 ` David CARLIER
@ 2026-05-20 12:40   ` Mike Rapoport
  0 siblings, 0 replies; 9+ messages in thread
From: Mike Rapoport @ 2026-05-20 12:40 UTC (permalink / raw)
  To: David CARLIER
  Cc: Andrew Morton, David Hildenbrand, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

Hi David,

On Tue, May 19, 2026 at 06:36:23AM +0100, David CARLIER wrote:
> On Tue, 19 May 2026 at 06:25, Mike Rapoport <rppt@kernel.org> wrote:
>
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >         return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> 
> 
> My only 2 cent, I would name it vma_uffd_effective_copy_ops() instead or
> a comment to highlight it is about "UFFDIO_COPY into a MAP_PRIVATE file-backed"
 
Too long for my taste :)
And comment is useful anyway as it explains why we override the ops at all.
 
> > +{
> > +       const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +       if (!ops)
> > +               return NULL;
> > +
> > +       /*
> > +        * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +        * memory. This is an effective ops override, so retry validation must
> > +        * compare the override result, not just vma->vm_ops->uffd_ops.
> > +        */
> > +       if (!(vma->vm_flags & VM_SHARED))
> > +               return &anon_uffd_ops;
> > +
> > +       return ops;
> > +}

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 11:09 ` David Hildenbrand (Arm)
@ 2026-05-20 12:53   ` Mike Rapoport
  2026-05-20 13:48     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Rapoport @ 2026-05-20 12:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> On 5/19/26 07:25, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
> > reacquires it afterwards. The destination VMA can be replaced during that
> > window.
> > 
> > The existing check compares vma_uffd_ops() before and after the retry, but
> > if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
> > MAP_PRIVATE (or vice versa) the replacement goes undetected.
> > 
> > The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
> > with shmem_alloc_folio() as anonymous and this will cause BUG() when
> > mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
> > 
> > The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
> > the page cache of the original VMA.
> > 
> > Introduce helpers for more comprehensive comparison of VMA state:
> > - vma_snapshot_get() to save the relevant VMA state into a struct
> >   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
> >   vm_file and pgoff) before dropping the lock
> > - vma_snapshot_changed() to compare the saved state with the state of the
> >   VMA acquired after retaking the locks
> > - vma_snapshot_put() to release vm_file pinning.
> > 
> > Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
> > error handling paths in mfill_copy_folio_retry().
> > 
> > Add vma_uffd_copy_ops() to avoid code duplication when original ops of
> > shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
> > 
> > Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
> > Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
> > Tested-by: Heechan Kang <gganji11@naver.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Co-developed-by: David Carlier <devnexen@gmail.com>
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 79 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 180bad42fc79..b70b84776a79 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -14,6 +14,8 @@
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/file.h>
> > +#include <linux/cleanup.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> >  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> >  }
> >  
> > +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
> > +{
> > +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
> > +
> > +	if (!ops)
> > +		return NULL;
> > +
> > +	/*
> > +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
> > +	 * memory. This is an effective ops override, so retry validation must
> > +	 * compare the override result, not just vma->vm_ops->uffd_ops.
> > +	 */
> > +	if (!(vma->vm_flags & VM_SHARED))
> > +		return &anon_uffd_ops;
> > +
> > +	return ops;
> > +}
> > +
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> >  {
> > @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> >  	return ret;
> >  }
> >  
> > +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
> > +
> > +struct vma_snapshot {
> > +	const struct vm_uffd_ops *copy_ops;
> > +	const struct vm_uffd_ops *ops;
> > +	struct file *file;
> > +	vma_flags_t flags;
> > +	pgoff_t pgoff;
> > +};
> 
> As this is all uffd specific, I wonder whether that should be "struct
> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.

It's local to this file, I wouldn't worry about namespacing until there's
an actual need if it will ever arise.
 
> From a high level, this LGTM.
> 
> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
> sequence counter that we simply increment on any VMA changes.

Do you mean per-VMA?
Per-MM counter would capture unrelated changes, e.g. an masvise() for
unrelated range.

We kinda have an infrastructure to detect VMA changes that affect uffd
operation for non-cooperative, we can also hook on that even if
UFFD_EVENT_* are not requested, but that's way more involved that this VMA
snapshot.
 
> -- 
> Cheers,
> 
> David

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 12:53   ` Mike Rapoport
@ 2026-05-20 13:48     ` David Hildenbrand (Arm)
  2026-05-20 14:03       ` David Hildenbrand (Arm)
  2026-05-20 14:12       ` Mike Rapoport
  0 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 13:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 14:53, Mike Rapoport wrote:
> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/19/26 07:25, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
>>> reacquires it afterwards. The destination VMA can be replaced during that
>>> window.
>>>
>>> The existing check compares vma_uffd_ops() before and after the retry, but
>>> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
>>> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>>>
>>> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
>>> with shmem_alloc_folio() as anonymous and this will cause BUG() when
>>> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>>>
>>> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
>>> the page cache of the original VMA.
>>>
>>> Introduce helpers for more comprehensive comparison of VMA state:
>>> - vma_snapshot_get() to save the relevant VMA state into a struct
>>>   vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>>>   vm_file and pgoff) before dropping the lock
>>> - vma_snapshot_changed() to compare the saved state with the state of the
>>>   VMA acquired after retaking the locks
>>> - vma_snapshot_put() to release vm_file pinning.
>>>
>>> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
>>> error handling paths in mfill_copy_folio_retry().
>>>
>>> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
>>> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
>>>
>>> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
>>> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
>>> Tested-by: Heechan Kang <gganji11@naver.com>
>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>> Co-developed-by: David Carlier <devnexen@gmail.com>
>>> Signed-off-by: David Carlier <devnexen@gmail.com>
>>> Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
>>> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
>>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> ---
>>>  mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 79 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>> index 180bad42fc79..b70b84776a79 100644
>>> --- a/mm/userfaultfd.c
>>> +++ b/mm/userfaultfd.c
>>> @@ -14,6 +14,8 @@
>>>  #include <linux/userfaultfd_k.h>
>>>  #include <linux/mmu_notifier.h>
>>>  #include <linux/hugetlb.h>
>>> +#include <linux/file.h>
>>> +#include <linux/cleanup.h>
>>>  #include <asm/tlbflush.h>
>>>  #include <asm/tlb.h>
>>>  #include "internal.h"
>>> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>>>  	return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>>>  }
>>>  
>>> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
>>> +{
>>> +	const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
>>> +
>>> +	if (!ops)
>>> +		return NULL;
>>> +
>>> +	/*
>>> +	 * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
>>> +	 * memory. This is an effective ops override, so retry validation must
>>> +	 * compare the override result, not just vma->vm_ops->uffd_ops.
>>> +	 */
>>> +	if (!(vma->vm_flags & VM_SHARED))
>>> +		return &anon_uffd_ops;
>>> +
>>> +	return ops;
>>> +}
>>> +
>>>  static __always_inline
>>>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>>>  {
>>> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>>>  	return ret;
>>>  }
>>>  
>>> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
>>> +
>>> +struct vma_snapshot {
>>> +	const struct vm_uffd_ops *copy_ops;
>>> +	const struct vm_uffd_ops *ops;
>>> +	struct file *file;
>>> +	vma_flags_t flags;
>>> +	pgoff_t pgoff;
>>> +};
>>
>> As this is all uffd specific, I wonder whether that should be "struct
>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
> 
> It's local to this file, I wouldn't worry about namespacing until there's
> an actual need if it will ever arise.

Indeed, fair enough.

>  
>> From a high level, this LGTM.
>>
>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
>> sequence counter that we simply increment on any VMA changes.
> 
> Do you mean per-VMA?
> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> unrelated range.

I was thinking about a per-MM thing for simplicity. If there were any changes,
we'd retry (-EAGAIN).

IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.

Not sure if unrelated changes would really be a problem in practice (another
thread gabbing the mmap_lock in write mode until we serviced the fault).

> 
> We kinda have an infrastructure to detect VMA changes that affect uffd
> operation for non-cooperative, we can also hook on that even if
> UFFD_EVENT_* are not requested, but that's way more involved that this VMA
> snapshot.

Yeah, that sounds too complicated and what you have here is simpler.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 13:48     ` David Hildenbrand (Arm)
@ 2026-05-20 14:03       ` David Hildenbrand (Arm)
  2026-05-20 14:12       ` Mike Rapoport
  1 sibling, 0 replies; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 14:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 15:48, David Hildenbrand (Arm) wrote:
> On 5/20/26 14:53, Mike Rapoport wrote:
>> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> As this is all uffd specific, I wonder whether that should be "struct
>>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
>>
>> It's local to this file, I wouldn't worry about namespacing until there's
>> an actual need if it will ever arise.
> 
> Indeed, fair enough.
> 
>>  
>>> From a high level, this LGTM.
>>>
>>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
>>> sequence counter that we simply increment on any VMA changes.
>>
>> Do you mean per-VMA?
>> Per-MM counter would capture unrelated changes, e.g. an masvise() for
>> unrelated range.
> 
> I was thinking about a per-MM thing for simplicity. If there were any changes,
> we'd retry (-EAGAIN).
> 
> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> 
> Not sure if unrelated changes would really be a problem in practice (another
> thread gabbing the mmap_lock in write mode until we serviced the fault).

Assume something like the following:


diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 180bad42fc79..608bb3c3a613 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -448,9 +448,12 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
 {
        const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
        unsigned long src_addr = state->src_addr;
+       unsigned int seq;
        void *kaddr;
        int err;

+       seq = raw_read_seqcount(&mm->mm_lock_seq);
+
        /* retry copying with mm_lock dropped */
        mfill_put_vma(state);

@@ -472,7 +475,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
         * (e.g. replaced with a hugetlb mapping), making the caller's
         * ops pointer stale.
         */
-       if (vma_uffd_ops(state->vma) != orig_ops)
+       if (read_seqcount_retry(&mm->mm_lock_seq, seq))
                return -EAGAIN;

        err = mfill_establish_pmd(state);


In case there is concurrent mmap_write lock, we'd retry. But we triggered and
resolved the fault. So the retry would just succeed in copying from the page,
not triggering another fault here.

So I'd not expect this to matter in practice?


We'd have to implement &mm->mm_lock_seq also without per-VMA locks. Which
wouldn't be too bad given that Dave proposed a patch set to just enable per-VMA
locks unconditionally.

As a hotfix, what you have is probably better, but a simplification like this
might later make sense.

-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 13:48     ` David Hildenbrand (Arm)
  2026-05-20 14:03       ` David Hildenbrand (Arm)
@ 2026-05-20 14:12       ` Mike Rapoport
  2026-05-20 14:38         ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 9+ messages in thread
From: Mike Rapoport @ 2026-05-20 14:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On Wed, May 20, 2026 at 03:48:02PM +0200, David Hildenbrand (Arm) wrote:
> On 5/20/26 14:53, Mike Rapoport wrote:
> > On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/19/26 07:25, Mike Rapoport wrote:
> > 
> > Do you mean per-VMA?
> > Per-MM counter would capture unrelated changes, e.g. an masvise() for
> > unrelated range.
> 
> I was thinking about a per-MM thing for simplicity. If there were any changes,
> we'd retry (-EAGAIN).
> 
> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> 
> Not sure if unrelated changes would really be a problem in practice (another
> thread gabbing the mmap_lock in write mode until we serviced the fault).

Let me reiterate:

A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
re-gets the VMA and checks if the per-MM counter stayed the same.

If another thread makes any change to VMA while mfill_copy_folio_retry()
waits to re-get the VMA, the counter would be incremented by another
thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
and will bail out with -EAGAIN.

So if another thread does, e.g. MADV_DONTFORK on completely unrelated VMA,
mfill_copy_folio_retry() will return -EAGAIN and I'm not sure we'll not
break the existing userspace by this.

> -- 
> Cheers,
> 
> David

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry
  2026-05-20 14:12       ` Mike Rapoport
@ 2026-05-20 14:38         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 14:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Carlier, Heechan Kang, Liam R. Howlett,
	Lorenzo Stoakes, Michael Bommarito, Peter Xu, linux-mm,
	linux-kernel

On 5/20/26 16:12, Mike Rapoport wrote:
> On Wed, May 20, 2026 at 03:48:02PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/20/26 14:53, Mike Rapoport wrote:
>>>
>>> Do you mean per-VMA?
>>> Per-MM counter would capture unrelated changes, e.g. an masvise() for
>>> unrelated range.
>>
>> I was thinking about a per-MM thing for simplicity. If there were any changes,
>> we'd retry (-EAGAIN).
>>
>> IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
>>
>> Not sure if unrelated changes would really be a problem in practice (another
>> thread gabbing the mmap_lock in write mode until we serviced the fault).
> 
> Let me reiterate:
> 
> A thread doing UFFDIO_COPY releases the VMA in mfill_copy_folio_retry(),
> re-gets the VMA and checks if the per-MM counter stayed the same.
> 
> If another thread makes any change to VMA while mfill_copy_folio_retry()
> waits to re-get the VMA, the counter would be incremented by another
> thread. mfill_copy_folio_retry() will see the change after mfill_get_vma()
> and will bail out with -EAGAIN.
> 

Yeah.

> So if another thread does, e.g. MADV_DONTFORK on completely unrelated VMA,
> mfill_copy_folio_retry() will return -EAGAIN and I'm not sure we'll not
> break the existing userspace by this.

Ah, you mean that -EAGAIN would not be handled by user space already even though
documented in the man page ... because it relies on -EAGAIN not happening in
specific situations (i.e., copying a single page).

That's very likely true and we'd have to retry internally, which makes it more
complicated indeed.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-20 14:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  5:25 [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry Mike Rapoport
2026-05-19  5:36 ` David CARLIER
2026-05-20 12:40   ` Mike Rapoport
2026-05-20 11:09 ` David Hildenbrand (Arm)
2026-05-20 12:53   ` Mike Rapoport
2026-05-20 13:48     ` David Hildenbrand (Arm)
2026-05-20 14:03       ` David Hildenbrand (Arm)
2026-05-20 14:12       ` Mike Rapoport
2026-05-20 14:38         ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox