[PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
@ 2026-03-31 13:41 David Carlier
  2026-04-01  3:01 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: David Carlier @ 2026-03-31 13:41 UTC (permalink / raw)
  To: Peter Xu, Andrew Morton
  Cc: Mike Rapoport, linux-mm, linux-kernel, David Carlier

In mfill_copy_folio_retry(), all locks are dropped to retry
copy_from_user() with page faults enabled. During this window, the VMA
can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
another thread), but the caller proceeds with a folio allocated from the
original VMA's backing store.

Checking ops alone is insufficient: the replacement VMA could be the
same type (e.g. shmem -> shmem) with identical flags but a different
backing inode. Take a snapshot of the VMA's file and flags before
dropping locks, and compare after re-acquiring them. If anything
changed, bail out with -EINVAL.

Use get_file()/fput() rather than ihold()/iput() to hold the file
reference across the lock-dropped window, avoiding potential deadlocks
from filesystem eviction under mmap_lock.

Fixes: 56a3706fd7f9 ("shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops")
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb4442..93d6a954e659 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -443,33 +443,82 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
 	return ret;
 }
 
+struct vma_snapshot {
+	struct file *file;
+	vma_flags_t flags;
+};
+
+static void vma_snapshot_take(struct vm_area_struct *vma,
+			      struct vma_snapshot *s)
+{
+	memcpy(&s->flags, &vma->flags, sizeof(s->flags));
+	if (vma->vm_file)
+		s->file = get_file(vma->vm_file);
+	else
+		s->file = NULL;
+}
+
+static bool vma_snapshot_changed(struct vm_area_struct *vma,
+				 struct vma_snapshot *s)
+{
+	if (memcmp(&s->flags, &vma->flags, sizeof(s->flags)))
+		return true;
+
+	if (s->file && (!vma->vm_file ||
+	    vma->vm_file->f_inode != s->file->f_inode))
+		return true;
+
+	if (!s->file && !vma_is_anonymous(vma))
+		return true;
+
+	return false;
+}
+
+static void vma_snapshot_release(struct vma_snapshot *s)
+{
+	if (s->file) {
+		fput(s->file);
+		s->file = NULL;
+	}
+}
+
 static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
 {
 	unsigned long src_addr = state->src_addr;
+	struct vma_snapshot s;
 	void *kaddr;
 	int err;
 
+	/* Take a quick snapshot of the current vma */
+	vma_snapshot_take(state->vma, &s);
+
 	/* retry copying with mm_lock dropped */
 	mfill_put_vma(state);
 
 	kaddr = kmap_local_folio(folio, 0);
 	err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
 	kunmap_local(kaddr);
-	if (unlikely(err))
-		return -EFAULT;
+	if (unlikely(err)) {
+		err = -EFAULT;
+		goto out;
+	}
 
 	flush_dcache_folio(folio);
 
 	/* reget VMA and PMD, they could change underneath us */
 	err = mfill_get_vma(state);
 	if (err)
-		return err;
+		goto out;
 
-	err = mfill_establish_pmd(state);
-	if (err)
-		return err;
+	if (vma_snapshot_changed(state->vma, &s)) {
+		err = -EINVAL;
+		goto out;
+	}
 
-	return 0;
+	err = mfill_establish_pmd(state);
+out:
+	vma_snapshot_release(&s);
+	return err;
 }
 
 static int __mfill_atomic_pte(struct mfill_state *state,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-03-31 13:41 [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry() David Carlier
@ 2026-04-01  3:01 ` Andrew Morton
  2026-04-01  7:49   ` Mike Rapoport
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2026-04-01  3:01 UTC (permalink / raw)
  To: David Carlier; +Cc: Peter Xu, Mike Rapoport, linux-mm, linux-kernel

On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote:

> In mfill_copy_folio_retry(), all locks are dropped to retry
> copy_from_user() with page faults enabled. During this window, the VMA
> can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> another thread), but the caller proceeds with a folio allocated from the
> original VMA's backing store.
> 
> Checking ops alone is insufficient: the replacement VMA could be the
> same type (e.g. shmem -> shmem) with identical flags but a different
> backing inode. Take a snapshot of the VMA's file and flags before
> dropping locks, and compare after re-acquiring them. If anything
> changed, bail out with -EINVAL.
> 
> Use get_file()/fput() rather than ihold()/iput() to hold the file
> reference across the lock-dropped window, avoiding potential deadlocks
> from filesystem eviction under mmap_lock.

Thanks, I've queued this as a squashable fix against mm-unstable's
"shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
ongoing".

I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
after -rc1.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01  3:01 ` Andrew Morton
@ 2026-04-01  7:49   ` Mike Rapoport
  2026-04-01  8:06     ` David CARLIER
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2026-04-01  7:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Carlier, Peter Xu, linux-mm, linux-kernel, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka

Hi Andrew,

On Tue, Mar 31, 2026 at 08:01:48PM -0700, Andrew Morton wrote:
> On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote:
> 
> > In mfill_copy_folio_retry(), all locks are dropped to retry
> > copy_from_user() with page faults enabled. During this window, the VMA
> > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> > another thread), but the caller proceeds with a folio allocated from the
> > original VMA's backing store.

What does "folio allocated from the original VMA's backing store" exactly
mean? Why is this a problem?
 
> > Checking ops alone is insufficient: the replacement VMA could be the
> > same type (e.g. shmem -> shmem) with identical flags but a different
> > backing inode. Take a snapshot of the VMA's file and flags before
> > dropping locks, and compare after re-acquiring them. If anything
> > changed, bail out with -EINVAL.
> > 
> > Use get_file()/fput() rather than ihold()/iput() to hold the file
> > reference across the lock-dropped window, avoiding potential deadlocks
> > from filesystem eviction under mmap_lock.
> 
> Thanks, I've queued this as a squashable fix against mm-unstable's
> "shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
> ongoing".

First, this a pre-existing and TBH quite theoretical bug and it was there
since the very beginning, so it should not be added as a fixup for the
uffd+guestmemfd series.

Second, I have reservations about vma_snapshot implementation. What
invariant does it exactly enforce? 
 
> I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
> after -rc1.

This one should go the same route IMO. 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01  7:49   ` Mike Rapoport
@ 2026-04-01  8:06     ` David CARLIER
  2026-04-01 15:23       ` Peter Xu
  2026-04-02  3:58       ` Mike Rapoport
  0 siblings, 2 replies; 13+ messages in thread
From: David CARLIER @ 2026-04-01  8:06 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Peter Xu, linux-mm, linux-kernel, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka

Hi Mike,

  On Tue, Apr 01, 2026 at 08:49:00AM +0300, Mike Rapoport wrote:
  > What does "folio allocated from the original VMA's backing store" exactly
  > mean? Why is this a problem?

  Fair point, the commit message was vague here. What I meant is:

  mfill_atomic_pte_copy() captures ops = vma_uffd_ops(state->vma) and
  passes it to __mfill_atomic_pte(). There, ops->alloc_folio() allocates
  a folio for the original VMA's inode (e.g. a shmem folio for that
  specific shmem inode). Then mfill_copy_folio_retry() drops all locks for
  the copy_from_user retry. After mfill_get_vma() re-acquires them,
  state->vma may now point to a replacement VMA, but ops is still the
  stale pointer from before the drop.

  The code then calls ops->filemap_add(folio, state->vma, ...) which
  would insert a folio allocated for the old inode into the new VMA's
  backing store. If the VMA changed type entirely (e.g. shmem -> anon),
  ops->filemap_add could be operating on a VMA that has no business
  receiving this folio.

  > First, this a pre-existing and TBH quite theoretical bug and it was there
  > since the very beginning, so it should not be added as a fixup for the
  > uffd+guestmemfd series.

  You're right. The race window (VMA replacement during the lock-dropped
  copy retry) existed in the original mcopy_atomic_pte() code long before
  the vm_uffd_ops refactoring. The Fixes tag pointing at 56a3706fd7f9 was
  wrong. I'll drop it and resend as a standalone fix against the original
  retry logic.

  > Second, I have reservations about vma_snapshot implementation. What
  > invariant does it exactly enforce?

  The invariant I was going for: "the folio we allocated is still
  compatible with the VMA we're about to install it into." Since
  alloc_folio() allocates from the VMA's backing file (inode), checking
  that vm_file is still the same after re-acquiring locks ensures the
  folio matches the inode. The vm_flags comparison was a secondary guard
  against permission/type changes during the window.

  That said, I can see the vma_snapshot abstraction is doing too much for
  what's really needed. Would a simpler approach work better — just
  saving vm_file (with get_file/fput) before the drop and comparing it
  directly after re-acquiring? That makes the invariant explicit: "same
  backing file means the folio is valid for this VMA."

  Happy to rework along those lines, or if you have a different approach
  in mind I'm open to suggestions.

  > > I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
  > > after -rc1.
  >
  > This one should go the same route IMO.

  Agreed, I'll resend both after -rc1.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01  8:06     ` David CARLIER
@ 2026-04-01 15:23       ` Peter Xu
  2026-04-01 18:34         ` David CARLIER
  2026-04-02  3:58       ` Mike Rapoport
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2026-04-01 15:23 UTC (permalink / raw)
  To: David CARLIER
  Cc: Mike Rapoport, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Wed, Apr 01, 2026 at 09:06:36AM +0100, David CARLIER wrote:
>   The invariant I was going for: "the folio we allocated is still
>   compatible with the VMA we're about to install it into." Since
>   alloc_folio() allocates from the VMA's backing file (inode), checking
>   that vm_file is still the same after re-acquiring locks ensures the
>   folio matches the inode. The vm_flags comparison was a secondary guard
>   against permission/type changes during the window.
> 
>   That said, I can see the vma_snapshot abstraction is doing too much for
>   what's really needed. Would a simpler approach work better — just
>   saving vm_file (with get_file/fput) before the drop and comparing it
>   directly after re-acquiring? That makes the invariant explicit: "same
>   backing file means the folio is valid for this VMA."

IMHO the flags is needed, consider a shared shmem vma remapped to a private
shmem vma.  That needs to be covered in the fix.

Actually instead of reducing checks, maybe we also need to check the offset
of the mapping too, that is: vma->vm_pgoff can't change otherwise it may
also affect how the back store would behave on this UFFDIO_COPY request.

For that, see the example of shmem_get_pgoff_policy() where it seems we can
apply different policies to different ranges of the back store.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01 15:23       ` Peter Xu
@ 2026-04-01 18:34         ` David CARLIER
  2026-04-01 19:22           ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David CARLIER @ 2026-04-01 18:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

Hi Peter,

  On Tue, Apr 01, 2026 at 04:23:00PM +0300, Peter Xu wrote:
  > IMHO the flags is needed, consider a shared shmem vma remapped to a private
  > shmem vma.  That needs to be covered in the fix.

  Right, I hadn't considered that case. Shared->private changes how
the
  folio gets handled even with the same backing file. I'll keep the
flags
  check.

  > Actually instead of reducing checks, maybe we also need to check
the offset
  > of the mapping too, that is: vma->vm_pgoff can't change otherwise it may
  > also affect how the back store would behave on this UFFDIO_COPY
request.
  >
  > For that, see the example of shmem_get_pgoff_policy() where it
seems we can
  > apply different policies to different ranges of the back store.

  Good point. If vm_pgoff changes, linear_page_index() derives a
  different page cache offset for the same virtual address, and
  shmem_get_pgoff_policy() could apply a different NUMA policy to that
  range. So the folio could end up at the wrong offset or with the wrong
  placement.

  I'll add vm_pgoff to the snapshot. So the full set of checks after
  re-acquiring locks would be: vm_file, vm_flags, and vm_pgoff — ensuring
  the folio was allocated for the right backing file, at the right
offset,
  with the right VMA type.

  Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01 18:34         ` David CARLIER
@ 2026-04-01 19:22           ` Peter Xu
  2026-04-01 20:05             ` David CARLIER
  2026-04-02  4:02             ` Mike Rapoport
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Xu @ 2026-04-01 19:22 UTC (permalink / raw)
  To: David CARLIER
  Cc: Mike Rapoport, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Wed, Apr 01, 2026 at 07:34:47PM +0100, David CARLIER wrote:
> Hi Peter,
> 
>   On Tue, Apr 01, 2026 at 04:23:00PM +0300, Peter Xu wrote:
>   > IMHO the flags is needed, consider a shared shmem vma remapped to a private
>   > shmem vma.  That needs to be covered in the fix.
> 
>   Right, I hadn't considered that case. Shared->private changes how
> the
>   folio gets handled even with the same backing file. I'll keep the
> flags
>   check.
> 
>   > Actually instead of reducing checks, maybe we also need to check
> the offset
>   > of the mapping too, that is: vma->vm_pgoff can't change otherwise it may
>   > also affect how the back store would behave on this UFFDIO_COPY
> request.
>   >
>   > For that, see the example of shmem_get_pgoff_policy() where it
> seems we can
>   > apply different policies to different ranges of the back store.
> 
>   Good point. If vm_pgoff changes, linear_page_index() derives a
>   different page cache offset for the same virtual address, and
>   shmem_get_pgoff_policy() could apply a different NUMA policy to that
>   range. So the folio could end up at the wrong offset or with the wrong
>   placement.
> 
>   I'll add vm_pgoff to the snapshot. So the full set of checks after
>   re-acquiring locks would be: vm_file, vm_flags, and vm_pgoff — ensuring
>   the folio was allocated for the right backing file, at the right
> offset,
>   with the right VMA type.

When caching the offset, we should likely use linear_page_index() with the
address provided rather than caching vma->vm_pgoff directly, then it'll
avoid same vm_pgoff while VMA mapping shifted like this:

  VMA1:  vm_pgoff=0x10000, vm_start=0x10000
  VMA2:  vm_pgoff=0x10000, vm_start=0xf000

So VMA1 unmapped, then the app mappped VMA2 at different VA but still cover
the address we're requesting for UFFDIO_COPY, even if the VMA will still
have the same vm_pgoff, the VA to access the same offset might change.
Using linear_page_index() will be accurate, IIUC.

The other thing is I just noticed the err code was changed to -EINVAL for
snapshot changed cases, sorry I didn't follow previously as closely on the
discussion.  I think it should be -EAGAIN.  It's because the userapp can't
resolve -EINVAL failures and app will crash.  In a VMA change use case, we
should return -EAGAIN to imply the app to retry, rather than crashing.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01 19:22           ` Peter Xu
@ 2026-04-01 20:05             ` David CARLIER
  2026-04-02  4:02             ` Mike Rapoport
  1 sibling, 0 replies; 13+ messages in thread
From: David CARLIER @ 2026-04-01 20:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Tue, Apr 01, 2026 at 08:22:00PM +0300, Peter Xu wrote:
  > When caching the offset, we should likely use linear_page_index() with the
  > address provided rather than caching vma->vm_pgoff directly, then
it'll
  > avoid same vm_pgoff while VMA mapping shifted like this:
  >
  >   VMA1:  vm_pgoff=0x10000, vm_start=0x10000
  >   VMA2:  vm_pgoff=0x10000, vm_start=0xf000

  Makes sense. linear_page_index() folds in vm_start so it catches the
  shifted mapping case where raw vm_pgoff alone wouldn't. I'll snapshot
  the computed pgoff instead.

  > The other thing is I just noticed the err code was changed to -EINVAL for
  > snapshot changed cases, sorry I didn't follow previously as closely on the
  > discussion.  I think it should be -EAGAIN.  It's because the userapp can't
  > resolve -EINVAL failures and app will crash.  In a VMA change use
case, we
  > should return -EAGAIN to imply the app to retry, rather than crashing.

  Right, -EAGAIN is the correct choice here. The VMA changed underneath
  us, but that's a transient condition the app can recover from by
  retrying. Will fix.

  So v5 will snapshot: vm_file (with get_file/fput), vm_flags, and
  linear_page_index(vma, dst_addr). Return -EAGAIN if any changed.

  Thanks for the detailed review


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01 19:22           ` Peter Xu
  2026-04-01 20:05             ` David CARLIER
@ 2026-04-02  4:02             ` Mike Rapoport
  2026-04-02  5:59               ` David CARLIER
  2026-04-02 13:29               ` Peter Xu
  1 sibling, 2 replies; 13+ messages in thread
From: Mike Rapoport @ 2026-04-02  4:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: David CARLIER, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Wed, Apr 01, 2026 at 03:22:03PM -0400, Peter Xu wrote:
>
> The other thing is I just noticed the err code was changed to -EINVAL for
> snapshot changed cases, sorry I didn't follow previously as closely on the
> discussion.  I think it should be -EAGAIN.  It's because the userapp can't
> resolve -EINVAL failures and app will crash.  In a VMA change use case, we
> should return -EAGAIN to imply the app to retry, rather than crashing.

No. The return value should express that the VMA is invalid. -EINVAL could
work, but looking now at the manual -ENOENT would be even better:

       ENOENT (since Linux 4.11)
              The faulting process has changed its virtual memory layout
              simultaneously with an outstanding UFFDIO_COPY operation.

> Thanks,
> -- 
> Peter Xu

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-02  4:02             ` Mike Rapoport
@ 2026-04-02  5:59               ` David CARLIER
  2026-04-02 13:29               ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: David CARLIER @ 2026-04-02  5:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Peter Xu, Andrew Morton, linux-mm, linux-kernel, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka

 Understood on all points. Will rework patch 1 to a simple ops
comparison with -ENOENT, drop vma_snapshot entirely. Will cc MEMORY
MAPPING folks.

  Holding off until after -rc1, will resend both then.

Cheers.

On Thu, 2 Apr 2026 at 05:02, Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Apr 01, 2026 at 03:22:03PM -0400, Peter Xu wrote:
> >
> > The other thing is I just noticed the err code was changed to -EINVAL for
> > snapshot changed cases, sorry I didn't follow previously as closely on the
> > discussion.  I think it should be -EAGAIN.  It's because the userapp can't
> > resolve -EINVAL failures and app will crash.  In a VMA change use case, we
> > should return -EAGAIN to imply the app to retry, rather than crashing.
>
> No. The return value should express that the VMA is invalid. -EINVAL could
> work, but looking now at the manual -ENOENT would be even better:
>
>        ENOENT (since Linux 4.11)
>               The faulting process has changed its virtual memory layout
>               simultaneously with an outstanding UFFDIO_COPY operation.
>
> > Thanks,
> > --
> > Peter Xu
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-02  4:02             ` Mike Rapoport
  2026-04-02  5:59               ` David CARLIER
@ 2026-04-02 13:29               ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Xu @ 2026-04-02 13:29 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David CARLIER, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

Hi, Mike,

On Thu, Apr 02, 2026 at 07:02:40AM +0300, Mike Rapoport wrote:
> On Wed, Apr 01, 2026 at 03:22:03PM -0400, Peter Xu wrote:
> >
> > The other thing is I just noticed the err code was changed to -EINVAL for
> > snapshot changed cases, sorry I didn't follow previously as closely on the
> > discussion.  I think it should be -EAGAIN.  It's because the userapp can't
> > resolve -EINVAL failures and app will crash.  In a VMA change use case, we
> > should return -EAGAIN to imply the app to retry, rather than crashing.
> 
> No. The return value should express that the VMA is invalid. -EINVAL could
> work, but looking now at the manual -ENOENT would be even better:
> 
>        ENOENT (since Linux 4.11)
>               The faulting process has changed its virtual memory layout
>               simultaneously with an outstanding UFFDIO_COPY operation.

The VMA changed, but it doesn't mean the UFFDIO_COPY becomes illegal, am I
right?

For example, I wonder if it's possible someone runs soft-dirty concurrently
with userfaultfd, we shouldn't fail the userapp if there's a concurrent
thread collecting dirty information, which IIUC can cause VMA flag changes,
and should be benign, and I think there can be other things causing the
interruption too.

-EAGAIN essentially requested the user to retry.  If the VMA is not legal
to be operated anymore, it will fail later.  However we should allow legal
users to pass.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-01  8:06     ` David CARLIER
  2026-04-01 15:23       ` Peter Xu
@ 2026-04-02  3:58       ` Mike Rapoport
  2026-04-02 13:42         ` Peter Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2026-04-02  3:58 UTC (permalink / raw)
  To: David CARLIER
  Cc: Andrew Morton, Peter Xu, linux-mm, linux-kernel, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka

Hi David,

It feels that you use an LLM for correspondence. Please tune it down to
produce more laconic and to the point responses.

On Wed, Apr 01, 2026 at 09:06:36AM +0100, David CARLIER wrote:
>   On Tue, Apr 01, 2026 at 08:49:00AM +0300, Mike Rapoport wrote:
>   > What does "folio allocated from the original VMA's backing store" exactly
>   > mean? Why is this a problem?
> 
>   Fair point, the commit message was vague here. What I meant is:
> 
>   mfill_atomic_pte_copy() captures ops = vma_uffd_ops(state->vma) and
>   passes it to __mfill_atomic_pte(). There, ops->alloc_folio() allocates
>   a folio for the original VMA's inode (e.g. a shmem folio for that
>   specific shmem inode). 

I wouldn't say ->alloc_folio() allocates a folio _for_ the inode, it
allocates it with inode's memory policy. Worst can happen without any
changes is that the allocated folio will end up in a wrong node.

This is still a footgun, but I don't see it as a big deal.
Let's revisit it after -rc1 and please make sure to cc "MEMORY MAPPING"
folks for insights about how to better track VMA changes or their absence.

>   Then mfill_copy_folio_retry() drops all locks for
>   the copy_from_user retry. After mfill_get_vma() re-acquires them,
>   state->vma may now point to a replacement VMA, but ops is still the
>   stale pointer from before the drop.

And this is a bug in my uffd refactoring, and it needs to be fixed ASAP
with a simple comparison of old ops and new ops.

>   > Second, I have reservations about vma_snapshot implementation. What
>   > invariant does it exactly enforce?
> 
>   The invariant I was going for: "the folio we allocated is still
>   compatible with the VMA we're about to install it into." Since
>   alloc_folio() allocates from the VMA's backing file (inode), checking
>   that vm_file is still the same after re-acquiring locks ensures the
>   folio matches the inode.

Again, it's not that folio matches the inode, but folio is allocated using
the correct mempolicy.

>   The vm_flags comparison was a secondary guard against permission/type
>   changes during the window.

Permissions should be fine, they are checked in userfaultfd_register.
Some other flags that don't matter to uffd operation may change during the
window, though and then a comparison of vm_flags will give a false
positive.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
  2026-04-02  3:58       ` Mike Rapoport
@ 2026-04-02 13:42         ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2026-04-02 13:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David CARLIER, Andrew Morton, linux-mm, linux-kernel,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

Hi, Mike,

Let me also leave my comments inline just for you to consider.

On Thu, Apr 02, 2026 at 06:58:33AM +0300, Mike Rapoport wrote:
> Hi David,
> 
> It feels that you use an LLM for correspondence. Please tune it down to
> produce more laconic and to the point responses.
> 
> On Wed, Apr 01, 2026 at 09:06:36AM +0100, David CARLIER wrote:
> >   On Tue, Apr 01, 2026 at 08:49:00AM +0300, Mike Rapoport wrote:
> >   > What does "folio allocated from the original VMA's backing store" exactly
> >   > mean? Why is this a problem?
> > 
> >   Fair point, the commit message was vague here. What I meant is:
> > 
> >   mfill_atomic_pte_copy() captures ops = vma_uffd_ops(state->vma) and
> >   passes it to __mfill_atomic_pte(). There, ops->alloc_folio() allocates
> >   a folio for the original VMA's inode (e.g. a shmem folio for that
> >   specific shmem inode). 
> 
> I wouldn't say ->alloc_folio() allocates a folio _for_ the inode, it
> allocates it with inode's memory policy. Worst can happen without any
> changes is that the allocated folio will end up in a wrong node.

For shmem it's only about mempolicy indeed, but since we're trying to
export it as an API in the series, IMHO it would be nice to be generic.  So
we shouldn't assume it's only about mempolicy, we should rely on detecting
any context change and bail out with -EAGAIN, relying all rest checks to
the next UFFDIO_COPY ioctl done on top of the new mapping topology.

> 
> This is still a footgun, but I don't see it as a big deal.

IIUC this is a real bug reported.  Actually, if my understanding is
correct, we should be able to easily write a reproducer by registering the
src addr of UFFDIO_COPY to userfaultfd too, then the ioctl(UFFDIO_COPY)
thread will get blocked faulting in the src_addr.  During that, we can
change the VMA layout in another thread to test whatever setup we want.

> Let's revisit it after -rc1 and please make sure to cc "MEMORY MAPPING"
> folks for insights about how to better track VMA changes or their absence.

No strong feeling here if we want to slightly postpone this fix.  It looks
like not easy to happen as it looks to be a bug present for a while, indeed.

It's just that if my understanding is correct, with above reproducer we can
crash the kernel easily without a proper fix.

> 
> >   Then mfill_copy_folio_retry() drops all locks for
> >   the copy_from_user retry. After mfill_get_vma() re-acquires them,
> >   state->vma may now point to a replacement VMA, but ops is still the
> >   stale pointer from before the drop.
> 
> And this is a bug in my uffd refactoring, and it needs to be fixed ASAP
> with a simple comparison of old ops and new ops.
>  
> >   > Second, I have reservations about vma_snapshot implementation. What
> >   > invariant does it exactly enforce?
> > 
> >   The invariant I was going for: "the folio we allocated is still
> >   compatible with the VMA we're about to install it into." Since
> >   alloc_folio() allocates from the VMA's backing file (inode), checking
> >   that vm_file is still the same after re-acquiring locks ensures the
> >   folio matches the inode.
> 
> Again, it's not that folio matches the inode, but folio is allocated using
> the correct mempolicy.
> 
> >   The vm_flags comparison was a secondary guard against permission/type
> >   changes during the window.
> 
> Permissions should be fine, they are checked in userfaultfd_register.
> Some other flags that don't matter to uffd operation may change during the
> window, though and then a comparison of vm_flags will give a false
> positive.

IMHO false positive is fine in this case when -EAGAIN will be used (which I
still think we should), if it only causes a retry.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-02 13:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 13:41 [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry() David Carlier
2026-04-01  3:01 ` Andrew Morton
2026-04-01  7:49   ` Mike Rapoport
2026-04-01  8:06     ` David CARLIER
2026-04-01 15:23       ` Peter Xu
2026-04-01 18:34         ` David CARLIER
2026-04-01 19:22           ` Peter Xu
2026-04-01 20:05             ` David CARLIER
2026-04-02  4:02             ` Mike Rapoport
2026-04-02  5:59               ` David CARLIER
2026-04-02 13:29               ` Peter Xu
2026-04-02  3:58       ` Mike Rapoport
2026-04-02 13:42         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox