[PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs
@ 2025-07-11 11:38 Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 01/10] mm/mremap: perform some simple cleanups Lorenzo Stoakes
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.

For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).

However, in instances where a user is moving VMAs, it is restrictive to
disallow this.

This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.

Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.

This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.

In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.

we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.

With this done, we can then fairly straightforwardly implement this
functionality.

This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.

It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.

The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.

While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.

v3:
* Disallowed move operation except for MREMAP_FIXED.
* Disallow gap at start of aggregate range to avoid confusion.
* Disallow any file-baked VMAs with custom get_unmapped_area.
* Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use
  separate target_addr var to track next target address.
* Check if first VMA fails multi VMA check, if so we'll allow one VMA but
  not multiple.
* Updated the commit message for patch 9 to be clearer about gap behaviour.
* Removed accidentally included debug goto statement in test (doh!). Test
  was and is passing regardless.
* Unmap target range in test, previously we ended up moving additional VMAs
  unintentionally. This still all passed :) but was not what was intended.
* Removed self-merge check - there is absolutely no way this can happen
  across multiple VMAs, as there is no means of moving VMAs such that a VMA
  merges with itself.

v2:
* Squashed uffd stub fix into series.
* Propagated tags, thanks!
* Fixed param naming in patch 4 as per Vlastimil.
* Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
  Liam.
* Correctly return -EFAULT if no VMAs in input range.
* Account for get_unmapped_area() disregarding MAP_FIXED and returning an
  altered address.
* Added additional explanatatory comment to the remap_move() function.
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (10):
  mm/mremap: perform some simple cleanups
  mm/mremap: refactor initial parameter sanity checks
  mm/mremap: put VMA check and prep logic into helper function
  mm/mremap: cleanup post-processing stage of mremap
  mm/mremap: use an explicit uffd failure path for mremap
  mm/mremap: check remap conditions earlier
  mm/mremap: move remap_is_valid() into check_prep_vma()
  mm/mremap: clean up mlock populate behaviour
  mm/mremap: permit mremap() move of multiple VMAs
  tools/testing/selftests: extend mremap_test to test multi-VMA mremap

 fs/userfaultfd.c                         |  15 +-
 include/linux/userfaultfd_k.h            |   5 +
 mm/mremap.c                              | 553 +++++++++++++++--------
 tools/testing/selftests/mm/mremap_test.c | 146 +++++-
 4 files changed, 518 insertions(+), 201 deletions(-)

--
2.50.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 01/10] mm/mremap: perform some simple cleanups
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 02/10] mm/mremap: refactor initial parameter sanity checks Lorenzo Stoakes
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

We const-ify the vrm flags parameter to indicate this will never change.

We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.

We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.

We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.

We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.

Additionally consistently use 'res' rather than 'ret' for result values.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 55 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 36585041c760..1815095c4bca 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -52,7 +52,7 @@ struct vma_remap_struct {
 	unsigned long addr;	/* User-specified address from which we remap. */
 	unsigned long old_len;	/* Length of range being remapped. */
 	unsigned long new_len;	/* Desired new length of mapping. */
-	unsigned long flags;	/* user-specified MREMAP_* flags. */
+	const unsigned long flags; /* user-specified MREMAP_* flags. */
 	unsigned long new_addr;	/* Optionally, desired new address. */
 
 	/* uffd state. */
@@ -911,7 +911,11 @@ static bool vrm_overlaps(struct vma_remap_struct *vrm)
 	return false;
 }
 
-/* Do the mremap() flags require that the new_addr parameter be specified? */
+/*
+ * Will a new address definitely be assigned? This either if the user specifies
+ * it via MREMAP_FIXED, or if MREMAP_DONTUNMAP is used, indicating we will
+ * always detemrine a target address.
+ */
 static bool vrm_implies_new_addr(struct vma_remap_struct *vrm)
 {
 	return vrm->flags & (MREMAP_FIXED | MREMAP_DONTUNMAP);
@@ -957,7 +961,7 @@ static unsigned long vrm_set_new_addr(struct vma_remap_struct *vrm)
  *
  * Returns true on success, false if insufficient memory to charge.
  */
-static bool vrm_charge(struct vma_remap_struct *vrm)
+static bool vrm_calc_charge(struct vma_remap_struct *vrm)
 {
 	unsigned long charged;
 
@@ -1262,8 +1266,11 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
 	if (err)
 		return err;
 
-	/* If accounted, charge the number of bytes the operation will use. */
-	if (!vrm_charge(vrm))
+	/*
+	 * If accounted, determine the number of bytes the operation will
+	 * charge.
+	 */
+	if (!vrm_calc_charge(vrm))
 		return -ENOMEM;
 
 	/* We don't want racing faults. */
@@ -1302,12 +1309,12 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
 }
 
 /*
- * resize_is_valid() - Ensure the vma can be resized to the new length at the give
- * address.
+ * remap_is_valid() - Ensure the VMA can be moved or resized to the new length,
+ * at the given address.
  *
  * Return 0 on success, error otherwise.
  */
-static int resize_is_valid(struct vma_remap_struct *vrm)
+static int remap_is_valid(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = vrm->vma;
@@ -1446,7 +1453,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
 		vrm->old_len = vrm->new_len;
 	}
 
-	err = resize_is_valid(vrm);
+	err = remap_is_valid(vrm);
 	if (err)
 		return err;
 
@@ -1571,7 +1578,7 @@ static unsigned long expand_vma_in_place(struct vma_remap_struct *vrm)
 	struct vm_area_struct *vma = vrm->vma;
 	VMA_ITERATOR(vmi, mm, vma->vm_end);
 
-	if (!vrm_charge(vrm))
+	if (!vrm_calc_charge(vrm))
 		return -ENOMEM;
 
 	/*
@@ -1632,7 +1639,7 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm)
 	unsigned long err;
 	unsigned long addr = vrm->addr;
 
-	err = resize_is_valid(vrm);
+	err = remap_is_valid(vrm);
 	if (err)
 		return err;
 
@@ -1705,18 +1712,20 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm)
 		return expand_vma(vrm);
 	}
 
-	BUG();
+	/* Should not be possible. */
+	WARN_ON_ONCE(1);
+	return -EINVAL;
 }
 
 static unsigned long do_mremap(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
-	unsigned long ret;
+	unsigned long res;
 
-	ret = check_mremap_params(vrm);
-	if (ret)
-		return ret;
+	res = check_mremap_params(vrm);
+	if (res)
+		return res;
 
 	vrm->old_len = PAGE_ALIGN(vrm->old_len);
 	vrm->new_len = PAGE_ALIGN(vrm->new_len);
@@ -1728,41 +1737,41 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 
 	vma = vrm->vma = vma_lookup(mm, vrm->addr);
 	if (!vma) {
-		ret = -EFAULT;
+		res = -EFAULT;
 		goto out;
 	}
 
 	/* If mseal()'d, mremap() is prohibited. */
 	if (!can_modify_vma(vma)) {
-		ret = -EPERM;
+		res = -EPERM;
 		goto out;
 	}
 
 	/* Align to hugetlb page size, if required. */
 	if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) {
-		ret = -EINVAL;
+		res = -EINVAL;
 		goto out;
 	}
 
 	vrm->remap_type = vrm_remap_type(vrm);
 
 	/* Actually execute mremap. */
-	ret = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
+	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
 
 out:
 	if (vrm->mmap_locked) {
 		mmap_write_unlock(mm);
 		vrm->mmap_locked = false;
 
-		if (!offset_in_page(ret) && vrm->mlocked && vrm->new_len > vrm->old_len)
+		if (!offset_in_page(res) && vrm->mlocked && vrm->new_len > vrm->old_len)
 			mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
 	}
 
 	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
-	mremap_userfaultfd_complete(vrm->uf, vrm->addr, ret, vrm->old_len);
+	mremap_userfaultfd_complete(vrm->uf, vrm->addr, res, vrm->old_len);
 	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
 
-	return ret;
+	return res;
 }
 
 /*
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 02/10] mm/mremap: refactor initial parameter sanity checks
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 01/10] mm/mremap: perform some simple cleanups Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 03/10] mm/mremap: put VMA check and prep logic into helper function Lorenzo Stoakes
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

We are currently checking some things later, and some things
immediately. Aggregate the checks and avoid ones that need not be made.

Simplify things by aligning lengths immediately. Defer setting the delta
parameter until later, which removes some duplicate code in the hugetlb
case.

We can safely perform the checks moved from mremap_to() to
check_mremap_params() because:

* If we set a new address via vrm_set_new_addr(), then this is guaranteed
  to not overlap nor to position the new VMA past TASK_SIZE, so there's no
  need to check these later.

* We can simply page align lengths immediately. We do not need to check for
  overlap nor TASK_SIZE sanity after hugetlb alignment as this asserts
  addresses are huge-aligned, then huge-aligns lengths, rounding down. This
  means any existing overlap would have already been caught.

Moving things around like this lays the groundwork for subsequent changes
to permit operations on batches of VMAs.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1815095c4bca..a00da0288c37 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1415,14 +1415,6 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
 	struct mm_struct *mm = current->mm;
 	unsigned long err;
 
-	/* Is the new length or address silly? */
-	if (vrm->new_len > TASK_SIZE ||
-	    vrm->new_addr > TASK_SIZE - vrm->new_len)
-		return -EINVAL;
-
-	if (vrm_overlaps(vrm))
-		return -EINVAL;
-
 	if (vrm->flags & MREMAP_FIXED) {
 		/*
 		 * In mremap_to().
@@ -1527,7 +1519,12 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	 * for DOS-emu "duplicate shm area" thing. But
 	 * a zero new-len is nonsensical.
 	 */
-	if (!PAGE_ALIGN(vrm->new_len))
+	if (!vrm->new_len)
+		return -EINVAL;
+
+	/* Is the new length or address silly? */
+	if (vrm->new_len > TASK_SIZE ||
+	    vrm->new_addr > TASK_SIZE - vrm->new_len)
 		return -EINVAL;
 
 	/* Remainder of checks are for cases with specific new_addr. */
@@ -1546,6 +1543,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len)
 		return -EINVAL;
 
+	/* Target VMA must not overlap source VMA. */
+	if (vrm_overlaps(vrm))
+		return -EINVAL;
+
 	/*
 	 * move_vma() need us to stay 4 maps below the threshold, otherwise
 	 * it will bail out at the very beginning.
@@ -1622,8 +1623,6 @@ static bool align_hugetlb(struct vma_remap_struct *vrm)
 	if (vrm->new_len > vrm->old_len)
 		return false;
 
-	vrm_set_delta(vrm);
-
 	return true;
 }
 
@@ -1723,14 +1722,13 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 	struct vm_area_struct *vma;
 	unsigned long res;
 
+	vrm->old_len = PAGE_ALIGN(vrm->old_len);
+	vrm->new_len = PAGE_ALIGN(vrm->new_len);
+
 	res = check_mremap_params(vrm);
 	if (res)
 		return res;
 
-	vrm->old_len = PAGE_ALIGN(vrm->old_len);
-	vrm->new_len = PAGE_ALIGN(vrm->new_len);
-	vrm_set_delta(vrm);
-
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 	vrm->mmap_locked = true;
@@ -1753,6 +1751,7 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 		goto out;
 	}
 
+	vrm_set_delta(vrm);
 	vrm->remap_type = vrm_remap_type(vrm);
 
 	/* Actually execute mremap. */
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 03/10] mm/mremap: put VMA check and prep logic into helper function
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 01/10] mm/mremap: perform some simple cleanups Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 02/10] mm/mremap: refactor initial parameter sanity checks Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 04/10] mm/mremap: cleanup post-processing stage of mremap Lorenzo Stoakes
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Rather than lumping everything together in do_mremap(), add a new helper
function, check_prep_vma(), to do the work relating to each VMA.

This further lays groundwork for subsequent patches which will allow for
batched VMA mremap().

Additionally, if we set vrm->new_addr == vrm->addr when prepping the VMA,
this avoids us needing to do so in the expand VMA mlocked case.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 58 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index a00da0288c37..d57645573e0d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1636,7 +1636,6 @@ static bool align_hugetlb(struct vma_remap_struct *vrm)
 static unsigned long expand_vma(struct vma_remap_struct *vrm)
 {
 	unsigned long err;
-	unsigned long addr = vrm->addr;
 
 	err = remap_is_valid(vrm);
 	if (err)
@@ -1651,16 +1650,8 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm)
 		if (err)
 			return err;
 
-		/*
-		 * We want to populate the newly expanded portion of the VMA to
-		 * satisfy the expectation that mlock()'ing a VMA maintains all
-		 * of its pages in memory.
-		 */
-		if (vrm->mlocked)
-			vrm->new_addr = addr;
-
 		/* OK we're done! */
-		return addr;
+		return vrm->addr;
 	}
 
 	/*
@@ -1716,10 +1707,33 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm)
 	return -EINVAL;
 }
 
+static int check_prep_vma(struct vma_remap_struct *vrm)
+{
+	struct vm_area_struct *vma = vrm->vma;
+
+	if (!vma)
+		return -EFAULT;
+
+	/* If mseal()'d, mremap() is prohibited. */
+	if (!can_modify_vma(vma))
+		return -EPERM;
+
+	/* Align to hugetlb page size, if required. */
+	if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm))
+		return -EINVAL;
+
+	vrm_set_delta(vrm);
+	vrm->remap_type = vrm_remap_type(vrm);
+	/* For convenience, we set new_addr even if VMA won't move. */
+	if (!vrm_implies_new_addr(vrm))
+		vrm->new_addr = vrm->addr;
+
+	return 0;
+}
+
 static unsigned long do_mremap(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma;
 	unsigned long res;
 
 	vrm->old_len = PAGE_ALIGN(vrm->old_len);
@@ -1733,26 +1747,10 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 		return -EINTR;
 	vrm->mmap_locked = true;
 
-	vma = vrm->vma = vma_lookup(mm, vrm->addr);
-	if (!vma) {
-		res = -EFAULT;
-		goto out;
-	}
-
-	/* If mseal()'d, mremap() is prohibited. */
-	if (!can_modify_vma(vma)) {
-		res = -EPERM;
-		goto out;
-	}
-
-	/* Align to hugetlb page size, if required. */
-	if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) {
-		res = -EINVAL;
+	vrm->vma = vma_lookup(current->mm, vrm->addr);
+	res = check_prep_vma(vrm);
+	if (res)
 		goto out;
-	}
-
-	vrm_set_delta(vrm);
-	vrm->remap_type = vrm_remap_type(vrm);
 
 	/* Actually execute mremap. */
 	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 04/10] mm/mremap: cleanup post-processing stage of mremap
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 03/10] mm/mremap: put VMA check and prep logic into helper function Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 05/10] mm/mremap: use an explicit uffd failure path for mremap Lorenzo Stoakes
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Separate out the uffd bits so it clear's what's happening.

Don't bother setting vrm->mmap_locked after unlocking, because after this
we are done anyway.

The only time we drop the mmap lock is on VMA shrink, at which point
vrm->new_len will be < vrm->old_len and the operation will not be performed
anyway, so move this code out of the if (vrm->mmap_locked) block.

All addresses returned by mremap() are page-aligned, so the
offset_in_page() check on ret seems only to be incorrectly trying to detect
whether an error occurred - explicitly check for this.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index d57645573e0d..87cab223f2bb 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1731,6 +1731,15 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	return 0;
 }
 
+static void notify_uffd(struct vma_remap_struct *vrm, unsigned long to)
+{
+	struct mm_struct *mm = current->mm;
+
+	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
+	mremap_userfaultfd_complete(vrm->uf, vrm->addr, to, vrm->old_len);
+	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
+}
+
 static unsigned long do_mremap(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
@@ -1756,18 +1765,13 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
 
 out:
-	if (vrm->mmap_locked) {
+	if (vrm->mmap_locked)
 		mmap_write_unlock(mm);
-		vrm->mmap_locked = false;
-
-		if (!offset_in_page(res) && vrm->mlocked && vrm->new_len > vrm->old_len)
-			mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
-	}
 
-	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
-	mremap_userfaultfd_complete(vrm->uf, vrm->addr, res, vrm->old_len);
-	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
+	if (!IS_ERR_VALUE(res) && vrm->mlocked && vrm->new_len > vrm->old_len)
+		mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
 
+	notify_uffd(vrm, res);
 	return res;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 05/10] mm/mremap: use an explicit uffd failure path for mremap
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 04/10] mm/mremap: cleanup post-processing stage of mremap Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 06/10] mm/mremap: check remap conditions earlier Lorenzo Stoakes
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Right now it appears that the code is relying upon the returned destination
address having bits outside PAGE_MASK to indicate whether an error value is
specified, and decrementing the increased refcount on the uffd ctx if so.

This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.

A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.

That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.

No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.

However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.

At this point, it is not correct to raise a uffd remap event, and we must
handle it.

This refactoring makes it abundantly clear what we are doing.

We assume vrm->new_addr is always valid, which a prior change made the case
even for mremap() invocations which don't move the VMA, however given no
uffd context would be set up in this case it's immaterial to this change
anyway.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/userfaultfd.c              | 15 ++++++++++-----
 include/linux/userfaultfd_k.h |  5 +++++
 mm/mremap.c                   | 16 ++++++++++++----
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2a644aa1a510..54c6cc7fe9c6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -750,11 +750,6 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx,
 	if (!ctx)
 		return;
 
-	if (to & ~PAGE_MASK) {
-		userfaultfd_ctx_put(ctx);
-		return;
-	}
-
 	msg_init(&ewq.msg);
 
 	ewq.msg.event = UFFD_EVENT_REMAP;
@@ -765,6 +760,16 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx,
 	userfaultfd_event_wait_completion(ctx, &ewq);
 }
 
+void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *vm_ctx)
+{
+	struct userfaultfd_ctx *ctx = vm_ctx->ctx;
+
+	if (!ctx)
+		return;
+
+	userfaultfd_ctx_put(ctx);
+}
+
 bool userfaultfd_remove(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index df85330bcfa6..c0e716aec26a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -259,6 +259,7 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *,
 extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *,
 					unsigned long from, unsigned long to,
 					unsigned long len);
+void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *);
 
 extern bool userfaultfd_remove(struct vm_area_struct *vma,
 			       unsigned long start,
@@ -371,6 +372,10 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *ctx,
 {
 }
 
+static inline void mremap_userfaultfd_fail(struct vm_userfaultfd_ctx *ctx)
+{
+}
+
 static inline bool userfaultfd_remove(struct vm_area_struct *vma,
 				      unsigned long start,
 				      unsigned long end)
diff --git a/mm/mremap.c b/mm/mremap.c
index 87cab223f2bb..b2ee95182b36 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1731,12 +1731,17 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	return 0;
 }
 
-static void notify_uffd(struct vma_remap_struct *vrm, unsigned long to)
+static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
 {
 	struct mm_struct *mm = current->mm;
 
+	/* Regardless of success/failure, we always notify of any unmaps. */
 	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
-	mremap_userfaultfd_complete(vrm->uf, vrm->addr, to, vrm->old_len);
+	if (failed)
+		mremap_userfaultfd_fail(vrm->uf);
+	else
+		mremap_userfaultfd_complete(vrm->uf, vrm->addr,
+			vrm->new_addr, vrm->old_len);
 	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
 }
 
@@ -1744,6 +1749,7 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long res;
+	bool failed;
 
 	vrm->old_len = PAGE_ALIGN(vrm->old_len);
 	vrm->new_len = PAGE_ALIGN(vrm->new_len);
@@ -1765,13 +1771,15 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
 
 out:
+	failed = IS_ERR_VALUE(res);
+
 	if (vrm->mmap_locked)
 		mmap_write_unlock(mm);
 
-	if (!IS_ERR_VALUE(res) && vrm->mlocked && vrm->new_len > vrm->old_len)
+	if (!failed && vrm->mlocked && vrm->new_len > vrm->old_len)
 		mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
 
-	notify_uffd(vrm, res);
+	notify_uffd(vrm, failed);
 	return res;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 06/10] mm/mremap: check remap conditions earlier
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 05/10] mm/mremap: use an explicit uffd failure path for mremap Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 07/10] mm/mremap: move remap_is_valid() into check_prep_vma() Lorenzo Stoakes
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

When we expand or move a VMA, this requires a number of additional checks
to be performed.

Make it really obvious under what circumstances these checks must be
performed and aggregate all the checks in one place by invoking this in
check_prep_vma().

We have to adjust the checks to account for shrink + move operations by
checking new_len <= old_len rather than new_len == old_len.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index b2ee95182b36..116203766ce0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1345,7 +1345,7 @@ static int remap_is_valid(struct vma_remap_struct *vrm)
 	if (old_len > vma->vm_end - addr)
 		return -EFAULT;
 
-	if (new_len == old_len)
+	if (new_len <= old_len)
 		return 0;
 
 	/* Need to be careful about a growing mapping */
@@ -1445,10 +1445,6 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
 		vrm->old_len = vrm->new_len;
 	}
 
-	err = remap_is_valid(vrm);
-	if (err)
-		return err;
-
 	/* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
 	if (vrm->flags & MREMAP_DONTUNMAP) {
 		vm_flags_t vm_flags = vrm->vma->vm_flags;
@@ -1637,10 +1633,6 @@ static unsigned long expand_vma(struct vma_remap_struct *vrm)
 {
 	unsigned long err;
 
-	err = remap_is_valid(vrm);
-	if (err)
-		return err;
-
 	/*
 	 * [addr, old_len) spans precisely to the end of the VMA, so try to
 	 * expand it in-place.
@@ -1707,6 +1699,21 @@ static unsigned long mremap_at(struct vma_remap_struct *vrm)
 	return -EINVAL;
 }
 
+/*
+ * Will this operation result in the VMA being expanded or moved and thus need
+ * to map a new portion of virtual address space?
+ */
+static bool vrm_will_map_new(struct vma_remap_struct *vrm)
+{
+	if (vrm->remap_type == MREMAP_EXPAND)
+		return true;
+
+	if (vrm_implies_new_addr(vrm))
+		return true;
+
+	return false;
+}
+
 static int check_prep_vma(struct vma_remap_struct *vrm)
 {
 	struct vm_area_struct *vma = vrm->vma;
@@ -1728,6 +1735,9 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	if (!vrm_implies_new_addr(vrm))
 		vrm->new_addr = vrm->addr;
 
+	if (vrm_will_map_new(vrm))
+		return remap_is_valid(vrm);
+
 	return 0;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 07/10] mm/mremap: move remap_is_valid() into check_prep_vma()
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 06/10] mm/mremap: check remap conditions earlier Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 08/10] mm/mremap: clean up mlock populate behaviour Lorenzo Stoakes
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Group parameter check logic together, moving check_mremap_params() next to
it.

This puts all such checks into a single place, and invokes them early so we
can simply bail out as soon as we are aware that a condition is not met.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 273 +++++++++++++++++++++++++---------------------------
 1 file changed, 131 insertions(+), 142 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 116203766ce0..3f8daa3314f0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1308,64 +1308,6 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
 	return err ? (unsigned long)err : vrm->new_addr;
 }
 
-/*
- * remap_is_valid() - Ensure the VMA can be moved or resized to the new length,
- * at the given address.
- *
- * Return 0 on success, error otherwise.
- */
-static int remap_is_valid(struct vma_remap_struct *vrm)
-{
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = vrm->vma;
-	unsigned long addr = vrm->addr;
-	unsigned long old_len = vrm->old_len;
-	unsigned long new_len = vrm->new_len;
-	unsigned long pgoff;
-
-	/*
-	 * !old_len is a special case where an attempt is made to 'duplicate'
-	 * a mapping.  This makes no sense for private mappings as it will
-	 * instead create a fresh/new mapping unrelated to the original.  This
-	 * is contrary to the basic idea of mremap which creates new mappings
-	 * based on the original.  There are no known use cases for this
-	 * behavior.  As a result, fail such attempts.
-	 */
-	if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
-		pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap.  This is not supported.\n",
-			     current->comm, current->pid);
-		return -EINVAL;
-	}
-
-	if ((vrm->flags & MREMAP_DONTUNMAP) &&
-			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
-		return -EINVAL;
-
-	/* We can't remap across vm area boundaries */
-	if (old_len > vma->vm_end - addr)
-		return -EFAULT;
-
-	if (new_len <= old_len)
-		return 0;
-
-	/* Need to be careful about a growing mapping */
-	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
-	pgoff += vma->vm_pgoff;
-	if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
-		return -EINVAL;
-
-	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
-		return -EFAULT;
-
-	if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
-		return -EAGAIN;
-
-	if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT))
-		return -ENOMEM;
-
-	return 0;
-}
-
 /*
  * The user has requested that the VMA be shrunk (i.e., old_len > new_len), so
  * execute this, optionally dropping the mmap lock when we do so.
@@ -1492,77 +1434,6 @@ static bool vrm_can_expand_in_place(struct vma_remap_struct *vrm)
 	return true;
 }
 
-/*
- * Are the parameters passed to mremap() valid? If so return 0, otherwise return
- * error.
- */
-static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
-
-{
-	unsigned long addr = vrm->addr;
-	unsigned long flags = vrm->flags;
-
-	/* Ensure no unexpected flag values. */
-	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
-		return -EINVAL;
-
-	/* Start address must be page-aligned. */
-	if (offset_in_page(addr))
-		return -EINVAL;
-
-	/*
-	 * We allow a zero old-len as a special case
-	 * for DOS-emu "duplicate shm area" thing. But
-	 * a zero new-len is nonsensical.
-	 */
-	if (!vrm->new_len)
-		return -EINVAL;
-
-	/* Is the new length or address silly? */
-	if (vrm->new_len > TASK_SIZE ||
-	    vrm->new_addr > TASK_SIZE - vrm->new_len)
-		return -EINVAL;
-
-	/* Remainder of checks are for cases with specific new_addr. */
-	if (!vrm_implies_new_addr(vrm))
-		return 0;
-
-	/* The new address must be page-aligned. */
-	if (offset_in_page(vrm->new_addr))
-		return -EINVAL;
-
-	/* A fixed address implies a move. */
-	if (!(flags & MREMAP_MAYMOVE))
-		return -EINVAL;
-
-	/* MREMAP_DONTUNMAP does not allow resizing in the process. */
-	if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len)
-		return -EINVAL;
-
-	/* Target VMA must not overlap source VMA. */
-	if (vrm_overlaps(vrm))
-		return -EINVAL;
-
-	/*
-	 * move_vma() need us to stay 4 maps below the threshold, otherwise
-	 * it will bail out at the very beginning.
-	 * That is a problem if we have already unmaped the regions here
-	 * (new_addr, and old_addr), because userspace will not know the
-	 * state of the vma's after it gets -ENOMEM.
-	 * So, to avoid such scenario we can pre-compute if the whole
-	 * operation has high chances to success map-wise.
-	 * Worst-scenario case is when both vma's (new_addr and old_addr) get
-	 * split in 3 before unmapping it.
-	 * That means 2 more maps (1 for each) to the ones we already hold.
-	 * Check whether current map count plus 2 still leads us to 4 maps below
-	 * the threshold, otherwise return -ENOMEM here to be more safe.
-	 */
-	if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
-		return -ENOMEM;
-
-	return 0;
-}
-
 /*
  * We know we can expand the VMA in-place by delta pages, so do so.
  *
@@ -1714,9 +1585,26 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
 	return false;
 }
 
+static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
+{
+	struct mm_struct *mm = current->mm;
+
+	/* Regardless of success/failure, we always notify of any unmaps. */
+	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
+	if (failed)
+		mremap_userfaultfd_fail(vrm->uf);
+	else
+		mremap_userfaultfd_complete(vrm->uf, vrm->addr,
+			vrm->new_addr, vrm->old_len);
+	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
+}
+
 static int check_prep_vma(struct vma_remap_struct *vrm)
 {
 	struct vm_area_struct *vma = vrm->vma;
+	struct mm_struct *mm = current->mm;
+	unsigned long addr = vrm->addr;
+	unsigned long old_len, new_len, pgoff;
 
 	if (!vma)
 		return -EFAULT;
@@ -1733,26 +1621,127 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	vrm->remap_type = vrm_remap_type(vrm);
 	/* For convenience, we set new_addr even if VMA won't move. */
 	if (!vrm_implies_new_addr(vrm))
-		vrm->new_addr = vrm->addr;
+		vrm->new_addr = addr;
+
+	/* Below only meaningful if we expand or move a VMA. */
+	if (!vrm_will_map_new(vrm))
+		return 0;
 
-	if (vrm_will_map_new(vrm))
-		return remap_is_valid(vrm);
+	old_len = vrm->old_len;
+	new_len = vrm->new_len;
+
+	/*
+	 * !old_len is a special case where an attempt is made to 'duplicate'
+	 * a mapping.  This makes no sense for private mappings as it will
+	 * instead create a fresh/new mapping unrelated to the original.  This
+	 * is contrary to the basic idea of mremap which creates new mappings
+	 * based on the original.  There are no known use cases for this
+	 * behavior.  As a result, fail such attempts.
+	 */
+	if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
+		pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap.  This is not supported.\n",
+			     current->comm, current->pid);
+		return -EINVAL;
+	}
+
+	if ((vrm->flags & MREMAP_DONTUNMAP) &&
+			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
+		return -EINVAL;
+
+	/* We can't remap across vm area boundaries */
+	if (old_len > vma->vm_end - addr)
+		return -EFAULT;
+
+	if (new_len <= old_len)
+		return 0;
+
+	/* Need to be careful about a growing mapping */
+	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
+	pgoff += vma->vm_pgoff;
+	if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+		return -EINVAL;
+
+	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
+		return -EFAULT;
+
+	if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
+		return -EAGAIN;
+
+	if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT))
+		return -ENOMEM;
 
 	return 0;
 }
 
-static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
+/*
+ * Are the parameters passed to mremap() valid? If so return 0, otherwise return
+ * error.
+ */
+static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
+
 {
-	struct mm_struct *mm = current->mm;
+	unsigned long addr = vrm->addr;
+	unsigned long flags = vrm->flags;
 
-	/* Regardless of success/failure, we always notify of any unmaps. */
-	userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
-	if (failed)
-		mremap_userfaultfd_fail(vrm->uf);
-	else
-		mremap_userfaultfd_complete(vrm->uf, vrm->addr,
-			vrm->new_addr, vrm->old_len);
-	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
+	/* Ensure no unexpected flag values. */
+	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
+		return -EINVAL;
+
+	/* Start address must be page-aligned. */
+	if (offset_in_page(addr))
+		return -EINVAL;
+
+	/*
+	 * We allow a zero old-len as a special case
+	 * for DOS-emu "duplicate shm area" thing. But
+	 * a zero new-len is nonsensical.
+	 */
+	if (!vrm->new_len)
+		return -EINVAL;
+
+	/* Is the new length or address silly? */
+	if (vrm->new_len > TASK_SIZE ||
+	    vrm->new_addr > TASK_SIZE - vrm->new_len)
+		return -EINVAL;
+
+	/* Remainder of checks are for cases with specific new_addr. */
+	if (!vrm_implies_new_addr(vrm))
+		return 0;
+
+	/* The new address must be page-aligned. */
+	if (offset_in_page(vrm->new_addr))
+		return -EINVAL;
+
+	/* A fixed address implies a move. */
+	if (!(flags & MREMAP_MAYMOVE))
+		return -EINVAL;
+
+	/* MREMAP_DONTUNMAP does not allow resizing in the process. */
+	if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len)
+		return -EINVAL;
+
+	/* Target VMA must not overlap source VMA. */
+	if (vrm_overlaps(vrm))
+		return -EINVAL;
+
+	/*
+	 * move_vma() need us to stay 4 maps below the threshold, otherwise
+	 * it will bail out at the very beginning.
+	 * That is a problem if we have already unmaped the regions here
+	 * (new_addr, and old_addr), because userspace will not know the
+	 * state of the vma's after it gets -ENOMEM.
+	 * So, to avoid such scenario we can pre-compute if the whole
+	 * operation has high chances to success map-wise.
+	 * Worst-scenario case is when both vma's (new_addr and old_addr) get
+	 * split in 3 before unmapping it.
+	 * That means 2 more maps (1 for each) to the ones we already hold.
+	 * Check whether current map count plus 2 still leads us to 4 maps below
+	 * the threshold, otherwise return -ENOMEM here to be more safe.
+	 */
+	if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
+		return -ENOMEM;
+
+	return 0;
 }
 
 static unsigned long do_mremap(struct vma_remap_struct *vrm)
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 08/10] mm/mremap: clean up mlock populate behaviour
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 07/10] mm/mremap: move remap_is_valid() into check_prep_vma() Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

When an mlock()'d VMA is expanded, we need to populate the expanded region
to maintain the contract that all mlock()'d memory is present (albeit -
with some period after mmap unlock where the expanded part of the mapping
remains unfaulted).

The current implementation is very unclear, so make it absolutely explicit
under what circumstances we do this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 3f8daa3314f0..8cb08ccea6ad 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -65,7 +65,7 @@ struct vma_remap_struct {
 
 	/* Internal state, determined in do_mremap(). */
 	unsigned long delta;		/* Absolute delta of old_len,new_len. */
-	bool mlocked;			/* Was the VMA mlock()'d? */
+	bool populate_expand;		/* mlock()'d expanded, must populate. */
 	enum mremap_type remap_type;	/* expand, shrink, etc. */
 	bool mmap_locked;		/* Is mm currently write-locked? */
 	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
@@ -1012,10 +1012,8 @@ static void vrm_stat_account(struct vma_remap_struct *vrm,
 	struct vm_area_struct *vma = vrm->vma;
 
 	vm_stat_account(mm, vma->vm_flags, pages);
-	if (vma->vm_flags & VM_LOCKED) {
+	if (vma->vm_flags & VM_LOCKED)
 		mm->locked_vm += pages;
-		vrm->mlocked = true;
-	}
 }
 
 /*
@@ -1655,6 +1653,10 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	if (new_len <= old_len)
 		return 0;
 
+	/* We are expanding and the VMA is mlock()'d so we need to populate. */
+	if (vma->vm_flags & VM_LOCKED)
+		vrm->populate_expand = true;
+
 	/* Need to be careful about a growing mapping */
 	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
 	pgoff += vma->vm_pgoff;
@@ -1775,7 +1777,8 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 	if (vrm->mmap_locked)
 		mmap_write_unlock(mm);
 
-	if (!failed && vrm->mlocked && vrm->new_len > vrm->old_len)
+	/* VMA mlock'd + was expanded, so populated expanded region. */
+	if (!failed && vrm->populate_expand)
 		mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
 
 	notify_uffd(vrm, failed);
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 08/10] mm/mremap: clean up mlock populate behaviour Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 13:34   ` Vlastimil Babka
                     ` (3 more replies)
  2025-07-11 11:38 ` [PATCH v3 10/10] tools/testing/selftests: extend mremap_test to test multi-VMA mremap Lorenzo Stoakes
  2025-07-11 13:45 ` [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  10 siblings, 4 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.

For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).

However, in instances where a user is moving VMAs, it is restrictive to
disallow this.

This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.

Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.

This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.

We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.

Should no VMA exist in the range, -EFAULT is returned as usual.

If a VMA move spans a single VMA - then there is no functional change.

Otherwise, we place additional requirements upon VMAs:

* They must not have a userfaultfd context associated with them - this
  requires dropping the lock to notify users, and we want to perform the
  operation with the mmap write lock held throughout.

* If file-backed, they cannot have a custom get_unmapped_area handler -
  this might result in MREMAP_FIXED not being honoured, which could result
  in unexpected positioning of VMAs in the moved region.

There may be gaps in the range of VMAs that are moved:

                   X        Y                       X        Y
                 <--->     <->                    <--->     <->
         |-------|   |-----| |-----|      |-------|   |-----| |-----|
         |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
         |-------|   |-----| |-----|      |-------|   |-----| |-----|
        addr                           new_addr

The move will preserve the gaps between each VMA.

Note that any failures encountered will result in a partial move. Since an
mremap() can fail at any time, this might result in only some of the VMAs
being moved.

Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.

We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.

So we optimise for the far, far more likely case of the operation being
entirely permissible.

In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 150 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 8cb08ccea6ad..59f49de0f84e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -69,6 +69,8 @@ struct vma_remap_struct {
 	enum mremap_type remap_type;	/* expand, shrink, etc. */
 	bool mmap_locked;		/* Is mm currently write-locked? */
 	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
+	bool seen_vma;			/* Is >1 VMA being moved? */
+	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
 };
 
 static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
@@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 		*new_vma_ptr = NULL;
 		return -ENOMEM;
 	}
+	if (vma != vrm->vma)
+		vrm->vmi_needs_reset = true;
+
 	vrm->vma = vma;
 	pmc.old = vma;
 	pmc.new = new_vma;
@@ -1583,6 +1588,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
 	return false;
 }
 
+/* Does this remap ONLY move mappings? */
+static bool vrm_move_only(struct vma_remap_struct *vrm)
+{
+	if (!(vrm->flags & MREMAP_FIXED))
+		return false;
+
+	if (vrm->old_len != vrm->new_len)
+		return false;
+
+	return true;
+}
+
 static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
 {
 	struct mm_struct *mm = current->mm;
@@ -1597,6 +1614,32 @@ static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
 	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
 }
 
+static bool vma_multi_allowed(struct vm_area_struct *vma)
+{
+	struct file *file;
+
+	/*
+	 * We can't support moving multiple uffd VMAs as notify requires
+	 * mmap lock to be dropped.
+	 */
+	if (userfaultfd_armed(vma))
+		return false;
+
+	/*
+	 * Custom get unmapped area might result in MREMAP_FIXED not
+	 * being obeyed.
+	 */
+	file = vma->vm_file;
+	if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) {
+		const struct file_operations *fop = file->f_op;
+
+		if (fop->get_unmapped_area)
+			return false;
+	}
+
+	return true;
+}
+
 static int check_prep_vma(struct vma_remap_struct *vrm)
 {
 	struct vm_area_struct *vma = vrm->vma;
@@ -1646,7 +1689,19 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
 		return -EINVAL;
 
-	/* We can't remap across vm area boundaries */
+	/*
+	 * We can't remap across the end of VMAs, as another VMA may be
+	 * adjacent:
+	 *
+	 *       addr   vma->vm_end
+	 *  |-----.----------|
+	 *  |     .          |
+	 *  |-----.----------|
+	 *        .<--------->xxx>
+	 *            old_len
+	 *
+	 * We also require that vma->vm_start <= addr < vma->vm_end.
+	 */
 	if (old_len > vma->vm_end - addr)
 		return -EFAULT;
 
@@ -1746,6 +1801,90 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	return 0;
 }
 
+static unsigned long remap_move(struct vma_remap_struct *vrm)
+{
+	struct vm_area_struct *vma;
+	unsigned long start = vrm->addr;
+	unsigned long end = vrm->addr + vrm->old_len;
+	unsigned long new_addr = vrm->new_addr;
+	unsigned long target_addr = new_addr;
+	unsigned long res = -EFAULT;
+	unsigned long last_end;
+	bool allowed = true;
+	VMA_ITERATOR(vmi, current->mm, start);
+
+	/*
+	 * When moving VMAs we allow for batched moves across multiple VMAs,
+	 * with all VMAs in the input range [addr, addr + old_len) being moved
+	 * (and split as necessary).
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		/* Account for start, end not aligned with VMA start, end. */
+		unsigned long addr = max(vma->vm_start, start);
+		unsigned long len = min(end, vma->vm_end) - addr;
+		unsigned long offset, res_vma;
+
+		if (!allowed)
+			return -EFAULT;
+
+		/* No gap permitted at the start of the range. */
+		if (!vrm->seen_vma && start < vma->vm_start)
+			return -EFAULT;
+
+		/*
+		 * To sensibly move multiple VMAs, accounting for the fact that
+		 * get_unmapped_area() may align even MAP_FIXED moves, we simply
+		 * attempt to move such that the gaps between source VMAs remain
+		 * consistent in destination VMAs, e.g.:
+		 *
+		 *           X        Y                       X        Y
+		 *         <--->     <->                    <--->     <->
+		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
+		 * |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
+		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
+		 *                               new_addr
+		 *
+		 * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
+		 */
+		offset = vrm->seen_vma ? vma->vm_start - last_end : 0;
+		last_end = vma->vm_end;
+
+		vrm->vma = vma;
+		vrm->addr = addr;
+		vrm->new_addr = target_addr + offset;
+		vrm->old_len = vrm->new_len = len;
+
+		allowed = vma_multi_allowed(vma);
+		if (vrm->seen_vma && !allowed)
+			return -EFAULT;
+
+		res_vma = check_prep_vma(vrm);
+		if (!res_vma)
+			res_vma = mremap_to(vrm);
+		if (IS_ERR_VALUE(res_vma))
+			return res_vma;
+
+		if (!vrm->seen_vma) {
+			VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
+			res = res_vma;
+		}
+
+		/* mmap lock is only dropped on shrink. */
+		VM_WARN_ON_ONCE(!vrm->mmap_locked);
+		/* This is a move, no expand should occur. */
+		VM_WARN_ON_ONCE(vrm->populate_expand);
+
+		if (vrm->vmi_needs_reset) {
+			vma_iter_reset(&vmi);
+			vrm->vmi_needs_reset = false;
+		}
+		vrm->seen_vma = true;
+		target_addr = res_vma + vrm->new_len;
+	}
+
+	return res;
+}
+
 static unsigned long do_mremap(struct vma_remap_struct *vrm)
 {
 	struct mm_struct *mm = current->mm;
@@ -1763,13 +1902,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
 		return -EINTR;
 	vrm->mmap_locked = true;
 
-	vrm->vma = vma_lookup(current->mm, vrm->addr);
-	res = check_prep_vma(vrm);
-	if (res)
-		goto out;
+	if (vrm_move_only(vrm)) {
+		res = remap_move(vrm);
+	} else {
+		vrm->vma = vma_lookup(current->mm, vrm->addr);
+		res = check_prep_vma(vrm);
+		if (res)
+			goto out;
 
-	/* Actually execute mremap. */
-	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
+		/* Actually execute mremap. */
+		res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
+	}
 
 out:
 	failed = IS_ERR_VALUE(res);
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 10/10] tools/testing/selftests: extend mremap_test to test multi-VMA mremap
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (8 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
@ 2025-07-11 11:38 ` Lorenzo Stoakes
  2025-07-11 13:45 ` [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Now that we have added the ability to move multiple VMAs at once, assert
that this functions correctly, both overwriting VMAs and moving backwards
and forwards with merge and VMA invalidation.

Additionally assert that page tables are correctly propagated by setting
random data and reading it back.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/mremap_test.c | 146 ++++++++++++++++++++++-
 1 file changed, 145 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index bb84476a177f..0a49be11e614 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -380,6 +380,149 @@ static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr)
 		ksft_test_result_fail("%s\n", test_name);
 }
 
+static bool is_multiple_vma_range_ok(unsigned int pattern_seed,
+				     char *ptr, unsigned long page_size)
+{
+	int i;
+
+	srand(pattern_seed);
+	for (i = 0; i <= 10; i += 2) {
+		int j;
+		char *buf = &ptr[i * page_size];
+		size_t size = i == 4 ? 2 * page_size : page_size;
+
+		for (j = 0; j < size; j++) {
+			char chr = rand();
+
+			if (chr != buf[j]) {
+				ksft_print_msg("page %d offset %d corrupted, expected %d got %d\n",
+					       i, j, chr, buf[j]);
+				return false;
+			}
+		}
+	}
+
+	return true;
+}
+
+static void mremap_move_multiple_vmas(unsigned int pattern_seed,
+				      unsigned long page_size)
+{
+	char *test_name = "mremap move multiple vmas";
+	const size_t size = 11 * page_size;
+	bool success = true;
+	char *ptr, *tgt_ptr;
+	int i;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (ptr == MAP_FAILED) {
+		perror("mmap");
+		success = false;
+		goto out;
+	}
+
+	tgt_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+		       MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (tgt_ptr == MAP_FAILED) {
+		perror("mmap");
+		success = false;
+		goto out;
+	}
+	if (munmap(tgt_ptr, 2 * size)) {
+		perror("munmap");
+		success = false;
+		goto out_unmap;
+	}
+
+	/*
+	 * Unmap so we end up with:
+	 *
+	 *  0   2   4 5 6   8   10 offset in buffer
+	 * |*| |*| |*****| |*| |*|
+	 * |*| |*| |*****| |*| |*|
+	 *  0   1   2 3 4   5   6  pattern offset
+	 */
+	for (i = 1; i < 10; i += 2) {
+		if (i == 5)
+			continue;
+
+		if (munmap(&ptr[i * page_size], page_size)) {
+			perror("munmap");
+			success = false;
+			goto out_unmap;
+		}
+	}
+
+	srand(pattern_seed);
+
+	/* Set up random patterns. */
+	for (i = 0; i <= 10; i += 2) {
+		int j;
+		size_t size = i == 4 ? 2 * page_size : page_size;
+		char *buf = &ptr[i * page_size];
+
+		for (j = 0; j < size; j++)
+			buf[j] = rand();
+	}
+
+	/* First, just move the whole thing. */
+	if (mremap(ptr, size, size,
+		   MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) {
+		perror("mremap");
+		success = false;
+		goto out_unmap;
+	}
+	/* Check move was ok. */
+	if (!is_multiple_vma_range_ok(pattern_seed, tgt_ptr, page_size)) {
+		success = false;
+		goto out_unmap;
+	}
+
+	/* Move next to itself. */
+	if (mremap(tgt_ptr, size, size,
+		   MREMAP_MAYMOVE | MREMAP_FIXED, &tgt_ptr[size]) == MAP_FAILED) {
+		perror("mremap");
+		goto out_unmap;
+	}
+	/* Check that the move is ok. */
+	if (!is_multiple_vma_range_ok(pattern_seed, &tgt_ptr[size], page_size)) {
+		success = false;
+		goto out_unmap;
+	}
+
+	/* Map a range to overwrite. */
+	if (mmap(tgt_ptr, size, PROT_NONE,
+		 MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0) == MAP_FAILED) {
+		perror("mmap tgt");
+		success = false;
+		goto out_unmap;
+	}
+	/* Move and overwrite. */
+	if (mremap(&tgt_ptr[size], size, size,
+		   MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr) == MAP_FAILED) {
+		perror("mremap");
+		goto out_unmap;
+	}
+	/* Check that the move is ok. */
+	if (!is_multiple_vma_range_ok(pattern_seed, tgt_ptr, page_size)) {
+		success = false;
+		goto out_unmap;
+	}
+
+out_unmap:
+	if (munmap(tgt_ptr, 2 * size))
+		perror("munmap tgt");
+	if (munmap(ptr, size))
+		perror("munmap src");
+
+out:
+	if (success)
+		ksft_test_result_pass("%s\n", test_name);
+	else
+		ksft_test_result_fail("%s\n", test_name);
+}
+
 /* Returns the time taken for the remap on success else returns -1. */
 static long long remap_region(struct config c, unsigned int threshold_mb,
 			      char *rand_addr)
@@ -721,7 +864,7 @@ int main(int argc, char **argv)
 	char *rand_addr;
 	size_t rand_size;
 	int num_expand_tests = 2;
-	int num_misc_tests = 2;
+	int num_misc_tests = 3;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
@@ -848,6 +991,7 @@ int main(int argc, char **argv)
 
 	mremap_move_within_range(pattern_seed, rand_addr);
 	mremap_move_1mb_from_start(pattern_seed, rand_addr);
+	mremap_move_multiple_vmas(pattern_seed, page_size);
 
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
@ 2025-07-11 13:34   ` Vlastimil Babka
  2025-07-11 13:49     ` Lorenzo Stoakes
  2025-07-11 14:14   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-07-11 13:34 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Jann Horn, Pedro Falcato, Rik van Riel,
	linux-mm, linux-fsdevel, linux-kernel, linux-kselftest, Linux API

+cc linux-api - see the description of the new behavior below

On 7/11/25 13:38, Lorenzo Stoakes wrote:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
> 
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
> 
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
> 
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
> 
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
> 
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
> 
> We only permit this for mremap() operations that do NOT change the size of
> the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
> 
> Should no VMA exist in the range, -EFAULT is returned as usual.
> 
> If a VMA move spans a single VMA - then there is no functional change.
> 
> Otherwise, we place additional requirements upon VMAs:
> 
> * They must not have a userfaultfd context associated with them - this
>   requires dropping the lock to notify users, and we want to perform the
>   operation with the mmap write lock held throughout.
> 
> * If file-backed, they cannot have a custom get_unmapped_area handler -
>   this might result in MREMAP_FIXED not being honoured, which could result
>   in unexpected positioning of VMAs in the moved region.
> 
> There may be gaps in the range of VMAs that are moved:
> 
>                    X        Y                       X        Y
>                  <--->     <->                    <--->     <->
>          |-------|   |-----| |-----|      |-------|   |-----| |-----|
>          |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
>          |-------|   |-----| |-----|      |-------|   |-----| |-----|
>         addr                           new_addr
> 
> The move will preserve the gaps between each VMA.

AFAIU "moving a gap" doesn't mean we unmap anything pre-existing where the
moved gap's range falls to, right? Worth pointing out explicitly.

> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
> 
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.
> 
> We don't try to assess ahead of time whether VMAs are valid according to
> the multi VMA rules, as it would be rather unusual for a user to mix
> uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
> specify custom get_unmapped_area() handlers in an aggregate operation.
> 
> So we optimise for the far, far more likely case of the operation being
> entirely permissible.

Guess it's the sanest thing to do given all the cirumstances.

> In the case of the move of a single VMA, the above conditions are
> permitted. This makes the behaviour identical for a single VMA as before.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some nits:

> ---
>  mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 150 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 8cb08ccea6ad..59f49de0f84e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -69,6 +69,8 @@ struct vma_remap_struct {
>  	enum mremap_type remap_type;	/* expand, shrink, etc. */
>  	bool mmap_locked;		/* Is mm currently write-locked? */
>  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> +	bool seen_vma;			/* Is >1 VMA being moved? */

Seems this could be local variable of remap_move().

> +	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
>  };
>  
>  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> @@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  		*new_vma_ptr = NULL;
>  		return -ENOMEM;
>  	}

A newline here?

> +	if (vma != vrm->vma)
> +		vrm->vmi_needs_reset = true;

A comment on what this condition means wouldn't hurt? Is it when "Source vma
may have been merged into new_vma" in copy_vma(), or when not?

(no comments bellow, remaining quoted part left for linux-api reference)

> +
>  	vrm->vma = vma;
>  	pmc.old = vma;
>  	pmc.new = new_vma;
> @@ -1583,6 +1588,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
>  	return false;
>  }
>  
> +/* Does this remap ONLY move mappings? */
> +static bool vrm_move_only(struct vma_remap_struct *vrm)
> +{
> +	if (!(vrm->flags & MREMAP_FIXED))
> +		return false;
> +
> +	if (vrm->old_len != vrm->new_len)
> +		return false;
> +
> +	return true;
> +}
> +
>  static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1597,6 +1614,32 @@ static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
>  	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
>  }
>  
> +static bool vma_multi_allowed(struct vm_area_struct *vma)
> +{
> +	struct file *file;
> +
> +	/*
> +	 * We can't support moving multiple uffd VMAs as notify requires
> +	 * mmap lock to be dropped.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		return false;
> +
> +	/*
> +	 * Custom get unmapped area might result in MREMAP_FIXED not
> +	 * being obeyed.
> +	 */
> +	file = vma->vm_file;
> +	if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) {
> +		const struct file_operations *fop = file->f_op;
> +
> +		if (fop->get_unmapped_area)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>  static int check_prep_vma(struct vma_remap_struct *vrm)
>  {
>  	struct vm_area_struct *vma = vrm->vma;
> @@ -1646,7 +1689,19 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
>  			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
>  		return -EINVAL;
>  
> -	/* We can't remap across vm area boundaries */
> +	/*
> +	 * We can't remap across the end of VMAs, as another VMA may be
> +	 * adjacent:
> +	 *
> +	 *       addr   vma->vm_end
> +	 *  |-----.----------|
> +	 *  |     .          |
> +	 *  |-----.----------|
> +	 *        .<--------->xxx>
> +	 *            old_len
> +	 *
> +	 * We also require that vma->vm_start <= addr < vma->vm_end.
> +	 */
>  	if (old_len > vma->vm_end - addr)
>  		return -EFAULT;
>  
> @@ -1746,6 +1801,90 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
>  	return 0;
>  }
>  
> +static unsigned long remap_move(struct vma_remap_struct *vrm)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long start = vrm->addr;
> +	unsigned long end = vrm->addr + vrm->old_len;
> +	unsigned long new_addr = vrm->new_addr;
> +	unsigned long target_addr = new_addr;
> +	unsigned long res = -EFAULT;
> +	unsigned long last_end;
> +	bool allowed = true;
> +	VMA_ITERATOR(vmi, current->mm, start);
> +
> +	/*
> +	 * When moving VMAs we allow for batched moves across multiple VMAs,
> +	 * with all VMAs in the input range [addr, addr + old_len) being moved
> +	 * (and split as necessary).
> +	 */
> +	for_each_vma_range(vmi, vma, end) {
> +		/* Account for start, end not aligned with VMA start, end. */
> +		unsigned long addr = max(vma->vm_start, start);
> +		unsigned long len = min(end, vma->vm_end) - addr;
> +		unsigned long offset, res_vma;
> +
> +		if (!allowed)
> +			return -EFAULT;
> +
> +		/* No gap permitted at the start of the range. */
> +		if (!vrm->seen_vma && start < vma->vm_start)
> +			return -EFAULT;
> +
> +		/*
> +		 * To sensibly move multiple VMAs, accounting for the fact that
> +		 * get_unmapped_area() may align even MAP_FIXED moves, we simply
> +		 * attempt to move such that the gaps between source VMAs remain
> +		 * consistent in destination VMAs, e.g.:
> +		 *
> +		 *           X        Y                       X        Y
> +		 *         <--->     <->                    <--->     <->
> +		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
> +		 * |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
> +		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
> +		 *                               new_addr
> +		 *
> +		 * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
> +		 */
> +		offset = vrm->seen_vma ? vma->vm_start - last_end : 0;
> +		last_end = vma->vm_end;
> +
> +		vrm->vma = vma;
> +		vrm->addr = addr;
> +		vrm->new_addr = target_addr + offset;
> +		vrm->old_len = vrm->new_len = len;
> +
> +		allowed = vma_multi_allowed(vma);
> +		if (vrm->seen_vma && !allowed)
> +			return -EFAULT;
> +
> +		res_vma = check_prep_vma(vrm);
> +		if (!res_vma)
> +			res_vma = mremap_to(vrm);
> +		if (IS_ERR_VALUE(res_vma))
> +			return res_vma;
> +
> +		if (!vrm->seen_vma) {
> +			VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
> +			res = res_vma;
> +		}
> +
> +		/* mmap lock is only dropped on shrink. */
> +		VM_WARN_ON_ONCE(!vrm->mmap_locked);
> +		/* This is a move, no expand should occur. */
> +		VM_WARN_ON_ONCE(vrm->populate_expand);
> +
> +		if (vrm->vmi_needs_reset) {
> +			vma_iter_reset(&vmi);
> +			vrm->vmi_needs_reset = false;
> +		}
> +		vrm->seen_vma = true;
> +		target_addr = res_vma + vrm->new_len;
> +	}
> +
> +	return res;
> +}
> +
>  static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1763,13 +1902,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  		return -EINTR;
>  	vrm->mmap_locked = true;
>  
> -	vrm->vma = vma_lookup(current->mm, vrm->addr);
> -	res = check_prep_vma(vrm);
> -	if (res)
> -		goto out;
> +	if (vrm_move_only(vrm)) {
> +		res = remap_move(vrm);
> +	} else {
> +		vrm->vma = vma_lookup(current->mm, vrm->addr);
> +		res = check_prep_vma(vrm);
> +		if (res)
> +			goto out;
>  
> -	/* Actually execute mremap. */
> -	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +		/* Actually execute mremap. */
> +		res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +	}
>  
>  out:
>  	failed = IS_ERR_VALUE(res);



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                   ` (9 preceding siblings ...)
  2025-07-11 11:38 ` [PATCH v3 10/10] tools/testing/selftests: extend mremap_test to test multi-VMA mremap Lorenzo Stoakes
@ 2025-07-11 13:45 ` Lorenzo Stoakes
  10 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 13:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, linux-api

Apologies, I've done it again... :) Friday eh?

+cc linux-api - FYI guys.

I will update manpage to reflect this change of course also.

On Fri, Jul 11, 2025 at 12:38:14PM +0100, Lorenzo Stoakes wrote:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
>
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
>
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
>
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
>
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
>
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
>
> In order to do this, this series performs a large amount of refactoring,
> most pertinently - grouping sanity checks together, separately those that
> check input parameters and those relating to VMAs.
>
> we also simplify the post-mmap lock drop processing for uffd and mlock()'d
> VMAs.
>
> With this done, we can then fairly straightforwardly implement this
> functionality.
>
> This works exclusively for mremap() invocations which specify
> MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
> notification of the userland fault handler would require us to drop the
> mmap lock.
>
> It is also not compatible with file-backed mappings with customised
> get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
>
> The input and output addresses ranges must not overlap. We carefully
> account for moves which would result in VMA iterator invalidation.
>
> While there can be gaps between VMAs in the input range, there can be no
> gap before the first VMA in the range.
>
>
> v3:
> * Disallowed move operation except for MREMAP_FIXED.
> * Disallow gap at start of aggregate range to avoid confusion.
> * Disallow any file-baked VMAs with custom get_unmapped_area.
> * Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use
>   separate target_addr var to track next target address.
> * Check if first VMA fails multi VMA check, if so we'll allow one VMA but
>   not multiple.
> * Updated the commit message for patch 9 to be clearer about gap behaviour.
> * Removed accidentally included debug goto statement in test (doh!). Test
>   was and is passing regardless.
> * Unmap target range in test, previously we ended up moving additional VMAs
>   unintentionally. This still all passed :) but was not what was intended.
> * Removed self-merge check - there is absolutely no way this can happen
>   across multiple VMAs, as there is no means of moving VMAs such that a VMA
>   merges with itself.
>
> v2:
> * Squashed uffd stub fix into series.
> * Propagated tags, thanks!
> * Fixed param naming in patch 4 as per Vlastimil.
> * Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
>   Liam.
> * Correctly return -EFAULT if no VMAs in input range.
> * Account for get_unmapped_area() disregarding MAP_FIXED and returning an
>   altered address.
> * Added additional explanatatory comment to the remap_move() function.
> https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
>
> v1:
> https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
>
>
> Lorenzo Stoakes (10):
>   mm/mremap: perform some simple cleanups
>   mm/mremap: refactor initial parameter sanity checks
>   mm/mremap: put VMA check and prep logic into helper function
>   mm/mremap: cleanup post-processing stage of mremap
>   mm/mremap: use an explicit uffd failure path for mremap
>   mm/mremap: check remap conditions earlier
>   mm/mremap: move remap_is_valid() into check_prep_vma()
>   mm/mremap: clean up mlock populate behaviour
>   mm/mremap: permit mremap() move of multiple VMAs
>   tools/testing/selftests: extend mremap_test to test multi-VMA mremap
>
>  fs/userfaultfd.c                         |  15 +-
>  include/linux/userfaultfd_k.h            |   5 +
>  mm/mremap.c                              | 553 +++++++++++++++--------
>  tools/testing/selftests/mm/mremap_test.c | 146 +++++-
>  4 files changed, 518 insertions(+), 201 deletions(-)
>
> --
> 2.50.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 13:34   ` Vlastimil Babka
@ 2025-07-11 13:49     ` Lorenzo Stoakes
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 13:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 11, 2025 at 03:34:23PM +0200, Vlastimil Babka wrote:
> +cc linux-api - see the description of the new behavior below

Ah yeah :) I sent on 0/10 also. Friday...

>
> On 7/11/25 13:38, Lorenzo Stoakes wrote:
> > Historically we've made it a uAPI requirement that mremap() may only
> > operate on a single VMA at a time.
> >
> > For instances where VMAs need to be resized, this makes sense, as it
> > becomes very difficult to determine what a user actually wants should they
> > indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> > Adjust sizes individually? Some other strategy?).
> >
> > However, in instances where a user is moving VMAs, it is restrictive to
> > disallow this.
> >
> > This is especially the case when anonymous mapping remap may or may not be
> > mergeable depending on whether VMAs have or have not been faulted due to
> > anon_vma assignment and folio index alignment with vma->vm_pgoff.
> >
> > Often this can result in surprising impact where a moved region is faulted,
> > then moved back and a user fails to observe a merge from otherwise
> > compatible, adjacent VMAs.
> >
> > This change allows such cases to work without the user having to be
> > cognizant of whether a prior mremap() move or other VMA operations has
> > resulted in VMA fragmentation.
> >
> > We only permit this for mremap() operations that do NOT change the size of
> > the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
> >
> > Should no VMA exist in the range, -EFAULT is returned as usual.
> >
> > If a VMA move spans a single VMA - then there is no functional change.
> >
> > Otherwise, we place additional requirements upon VMAs:
> >
> > * They must not have a userfaultfd context associated with them - this
> >   requires dropping the lock to notify users, and we want to perform the
> >   operation with the mmap write lock held throughout.
> >
> > * If file-backed, they cannot have a custom get_unmapped_area handler -
> >   this might result in MREMAP_FIXED not being honoured, which could result
> >   in unexpected positioning of VMAs in the moved region.
> >
> > There may be gaps in the range of VMAs that are moved:
> >
> >                    X        Y                       X        Y
> >                  <--->     <->                    <--->     <->
> >          |-------|   |-----| |-----|      |-------|   |-----| |-----|
> >          |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
> >          |-------|   |-----| |-----|      |-------|   |-----| |-----|
> >         addr                           new_addr
> >
> > The move will preserve the gaps between each VMA.
>
> AFAIU "moving a gap" doesn't mean we unmap anything pre-existing where the
> moved gap's range falls to, right? Worth pointing out explicitly.
>
> > Note that any failures encountered will result in a partial move. Since an
> > mremap() can fail at any time, this might result in only some of the VMAs
> > being moved.
> >
> > Note that failures are very rare and typically require an out of a memory
> > condition or a mapping limit condition to be hit, assuming the VMAs being
> > moved are valid.
> >
> > We don't try to assess ahead of time whether VMAs are valid according to
> > the multi VMA rules, as it would be rather unusual for a user to mix
> > uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
> > specify custom get_unmapped_area() handlers in an aggregate operation.
> >
> > So we optimise for the far, far more likely case of the operation being
> > entirely permissible.
>
> Guess it's the sanest thing to do given all the cirumstances.
>
> > In the case of the move of a single VMA, the above conditions are
> > permitted. This makes the behaviour identical for a single VMA as before.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits:
>
> > ---
> >  mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 150 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 8cb08ccea6ad..59f49de0f84e 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -69,6 +69,8 @@ struct vma_remap_struct {
> >  	enum mremap_type remap_type;	/* expand, shrink, etc. */
> >  	bool mmap_locked;		/* Is mm currently write-locked? */
> >  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> > +	bool seen_vma;			/* Is >1 VMA being moved? */
>
> Seems this could be local variable of remap_move().

Yes, this is because before there _was_ some external use, but after rework
not any more. Will fix up in a fix-patch.

>
> > +	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
> >  };
> >
> >  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > @@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >  		*new_vma_ptr = NULL;
> >  		return -ENOMEM;
> >  	}
>
> A newline here?

I kinda thought it made sense to 'group' it with logic above, so this was
on purpose.

>
> > +	if (vma != vrm->vma)
> > +		vrm->vmi_needs_reset = true;
>
> A comment on what this condition means wouldn't hurt? Is it when "Source vma
> may have been merged into new_vma" in copy_vma(), or when not?
>

Sure will add in a fix-patch.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  2025-07-11 13:34   ` Vlastimil Babka
@ 2025-07-11 14:14   ` Lorenzo Stoakes
  2025-07-16 19:36   ` Lorenzo Stoakes
  2025-07-25 17:11   ` Jann Horn
  3 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Hi Andrew,

Just a quick fix-patch here to address some nits! :)

Thanks, Lorenzo

----8<----
From 3656941531c0fb796f5a8ca7664832e58d468607 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Fri, 11 Jul 2025 15:06:06 +0100
Subject: [PATCH] mm/mremap: address review comments

Make seen_vma a local variable, and add a comment when vmi_needs_reset is
set to explain what it's for.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 59f49de0f84e..142dc5287d8a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -69,7 +69,6 @@ struct vma_remap_struct {
 	enum mremap_type remap_type;	/* expand, shrink, etc. */
 	bool mmap_locked;		/* Is mm currently write-locked? */
 	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
-	bool seen_vma;			/* Is >1 VMA being moved? */
 	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
 };

@@ -1190,6 +1189,7 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 		*new_vma_ptr = NULL;
 		return -ENOMEM;
 	}
+	/* By merging, we may have invalidated any iterator in use. */
 	if (vma != vrm->vma)
 		vrm->vmi_needs_reset = true;

@@ -1807,10 +1807,10 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 	unsigned long start = vrm->addr;
 	unsigned long end = vrm->addr + vrm->old_len;
 	unsigned long new_addr = vrm->new_addr;
+	bool allowed = true, seen_vma = false;
 	unsigned long target_addr = new_addr;
 	unsigned long res = -EFAULT;
 	unsigned long last_end;
-	bool allowed = true;
 	VMA_ITERATOR(vmi, current->mm, start);

 	/*
@@ -1828,7 +1828,7 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 			return -EFAULT;

 		/* No gap permitted at the start of the range. */
-		if (!vrm->seen_vma && start < vma->vm_start)
+		if (!seen_vma && start < vma->vm_start)
 			return -EFAULT;

 		/*
@@ -1846,7 +1846,7 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 		 *
 		 * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
 		 */
-		offset = vrm->seen_vma ? vma->vm_start - last_end : 0;
+		offset = seen_vma ? vma->vm_start - last_end : 0;
 		last_end = vma->vm_end;

 		vrm->vma = vma;
@@ -1855,7 +1855,7 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 		vrm->old_len = vrm->new_len = len;

 		allowed = vma_multi_allowed(vma);
-		if (vrm->seen_vma && !allowed)
+		if (seen_vma && !allowed)
 			return -EFAULT;

 		res_vma = check_prep_vma(vrm);
@@ -1864,7 +1864,7 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 		if (IS_ERR_VALUE(res_vma))
 			return res_vma;

-		if (!vrm->seen_vma) {
+		if (!seen_vma) {
 			VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
 			res = res_vma;
 		}
@@ -1878,7 +1878,7 @@ static unsigned long remap_move(struct vma_remap_struct *vrm)
 			vma_iter_reset(&vmi);
 			vrm->vmi_needs_reset = false;
 		}
-		vrm->seen_vma = true;
+		seen_vma = true;
 		target_addr = res_vma + vrm->new_len;
 	}

--
2.50.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
  2025-07-11 13:34   ` Vlastimil Babka
  2025-07-11 14:14   ` Lorenzo Stoakes
@ 2025-07-16 19:36   ` Lorenzo Stoakes
  2025-07-25 17:11   ` Jann Horn
  3 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest

Hi Andrew,

Just a quick fix to address issues raised by syzkaller. I removed this code
previously based on misinterpreting review feedback as indicating that I
could do so...

In any case I have tested this against the repro (a well-placed RCU barrier
causes reliable repro it turns out) and confirmed it fixes the issue.

Thanks, Lorenzo

----8<----
From 4e07d53c6627af21847752ec71f5ecd00afab03b Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Wed, 16 Jul 2025 20:29:54 +0100
Subject: [PATCH] mm/mremap: reset VMI on unmap

Any separate VMA iterator may become invalidated when VMAs are unmapped at
nodes in proximity to the current position of the iterator.

Therefore, reset the iterator at each point where this occurs on a mremap
move.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 7a2e7022139a..15cbd41515ed 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1113,6 +1113,7 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)

 	err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false);
 	vrm->vma = NULL; /* Invalidated. */
+	vrm->vmi_needs_reset = true;
 	if (err) {
 		/* OOM: unable to split vma, just get accounts right */
 		vm_acct_memory(len >> PAGE_SHIFT);
@@ -1367,6 +1368,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
 		err = do_munmap(mm, vrm->new_addr, vrm->new_len,
 				vrm->uf_unmap_early);
 		vrm->vma = NULL; /* Invalidated. */
+		vrm->vmi_needs_reset = true;
 		if (err)
 			return err;

--
2.50.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2025-07-16 19:36   ` Lorenzo Stoakes
@ 2025-07-25 17:11   ` Jann Horn
  2025-07-25 17:27     ` Lorenzo Stoakes
  3 siblings, 1 reply; 20+ messages in thread
From: Jann Horn @ 2025-07-25 17:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
>
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.

Hrm. So if userspace tries to move a series of VMAs with mremap(), and
the operation fails, and userspace assumes the old syscall semantics,
userspace could assume that its memory is still at the old address,
when that's actually not true; and if userspace tries to access it
there, userspace UAF happens?

If we were explicitly killing the userspace process on this error
path, that'd be fine; but since we're just returning an error, we're
kind of making userspace believe that the move hasn't happened? (You
might notice that I'm generally in favor of killing userspace
processes when userspace does sufficiently weird things.)

I guess it's not going to happen particularly often since mremap()
with MREMAP_FIXED is a weirdly specific operation in the first place;
normal users of mremap() (like libc's realloc()) wouldn't have a
reason to use it...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 17:11   ` Jann Horn
@ 2025-07-25 17:27     ` Lorenzo Stoakes
  2025-07-25 19:10       ` Jann Horn
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 17:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Note that any failures encountered will result in a partial move. Since an
> > mremap() can fail at any time, this might result in only some of the VMAs
> > being moved.
> >
> > Note that failures are very rare and typically require an out of a memory
> > condition or a mapping limit condition to be hit, assuming the VMAs being
> > moved are valid.
>
> Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> the operation fails, and userspace assumes the old syscall semantics,
> userspace could assume that its memory is still at the old address,
> when that's actually not true; and if userspace tries to access it
> there, userspace UAF happens?

At 6pm on the last day of the cycle? :) dude :) this long week gets ever
longer...

I doubt any logic like this really exists, since mremap() is actually quite hard
to fail, and it'll likely be a mapping limit or oom issue (the latter being 'too
small to fail' so would mean your system is about to die anyway).

So it'd be very strange to then rely on that.

And the _usual_ sensible reasons why this might fail, would likely fail for
the first mapping (mapping limit, you'd probably hit it only on that one,
etc.)

The _unusual_ errors are typically user errors - you try to use with uffd
mappings, etc. well then that's not a functional program and we don't need
to worry.

And I sent a manpage change which very explicitly describes this behaviour.

In any case there is simply _no way_ to not do this.

If the failure's because of oom, we're screwed anyway and user segfault is
inevitable, we'll get into a pickle trying to move back.

Otherwise for mapping limit we likely hit it right away. I moved all the
checks up front for standard VMA/param errors.

The other kinds of errors would require you to try to move normal VMAs
right next to driver VMAs or a mix of uffd and not uffd VMAs or something
that'll be your fault.

So basically it'll nearly never happen, and it doesn't make much sense for
code to rely on this failing.

>
> If we were explicitly killing the userspace process on this error
> path, that'd be fine; but since we're just returning an error, we're
> kind of making userspace believe that the move hasn't happened? (You
> might notice that I'm generally in favor of killing userspace
> processes when userspace does sufficiently weird things.)

Well if we get them to segfault... :P

I think it's userspace's fault if they try to 'recover' based on shakey
assumptions.

>
> I guess it's not going to happen particularly often since mremap()
> with MREMAP_FIXED is a weirdly specific operation in the first place;
> normal users of mremap() (like libc's realloc()) wouldn't have a
> reason to use it...

Yes, and you would have to be using it such a way that things would have
broken before for very specific reasons.

I think the only truly viable 'if this fails assume still in place' might
very well be if the VMAs happened to be fragmented, which obviously this
now changes :)

So I don't think there's an issue here. Additionally it's very much 'the
kernel way' that partially failed aggregate operations don't unwind.

The key thing about mremap is that each individual move will be unwound
such that page tables remain valid...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 17:27     ` Lorenzo Stoakes
@ 2025-07-25 19:10       ` Jann Horn
  2025-07-25 19:59         ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Jann Horn @ 2025-07-25 19:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > Note that any failures encountered will result in a partial move. Since an
> > > mremap() can fail at any time, this might result in only some of the VMAs
> > > being moved.
> > >
> > > Note that failures are very rare and typically require an out of a memory
> > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > moved are valid.
> >
> > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > the operation fails, and userspace assumes the old syscall semantics,
> > userspace could assume that its memory is still at the old address,
> > when that's actually not true; and if userspace tries to access it
> > there, userspace UAF happens?
>
> At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> longer...

To be clear, I very much do not expect you to instantly reply to
random patch review mail I send you late on a Friday evening. :P

> Otherwise for mapping limit we likely hit it right away. I moved all the
> checks up front for standard VMA/param errors.

Ah, I missed that part.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 19:10       ` Jann Horn
@ 2025-07-25 19:59         ` Lorenzo Stoakes
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 19:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 09:10:25PM +0200, Jann Horn wrote:
> On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > Note that any failures encountered will result in a partial move. Since an
> > > > mremap() can fail at any time, this might result in only some of the VMAs
> > > > being moved.
> > > >
> > > > Note that failures are very rare and typically require an out of a memory
> > > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > > moved are valid.
> > >
> > > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > > the operation fails, and userspace assumes the old syscall semantics,
> > > userspace could assume that its memory is still at the old address,
> > > when that's actually not true; and if userspace tries to access it
> > > there, userspace UAF happens?
> >
> > At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> > longer...
>
> To be clear, I very much do not expect you to instantly reply to
> random patch review mail I send you late on a Friday evening. :P

Haha sure, just keen to clarify things on this!

>
> > Otherwise for mapping limit we likely hit it right away. I moved all the
> > checks up front for standard VMA/param errors.
>
> Ah, I missed that part.

Yeah the refactoring part of the series prior to this patch goes to great
lengths to prepare for this (including in moving tests earlier - all of
which I confirmed were good to be done earlier).

:)


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-07-25 20:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11 11:38 [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 01/10] mm/mremap: perform some simple cleanups Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 02/10] mm/mremap: refactor initial parameter sanity checks Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 03/10] mm/mremap: put VMA check and prep logic into helper function Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 04/10] mm/mremap: cleanup post-processing stage of mremap Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 05/10] mm/mremap: use an explicit uffd failure path for mremap Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 06/10] mm/mremap: check remap conditions earlier Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 07/10] mm/mremap: move remap_is_valid() into check_prep_vma() Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 08/10] mm/mremap: clean up mlock populate behaviour Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
2025-07-11 13:34   ` Vlastimil Babka
2025-07-11 13:49     ` Lorenzo Stoakes
2025-07-11 14:14   ` Lorenzo Stoakes
2025-07-16 19:36   ` Lorenzo Stoakes
2025-07-25 17:11   ` Jann Horn
2025-07-25 17:27     ` Lorenzo Stoakes
2025-07-25 19:10       ` Jann Horn
2025-07-25 19:59         ` Lorenzo Stoakes
2025-07-11 11:38 ` [PATCH v3 10/10] tools/testing/selftests: extend mremap_test to test multi-VMA mremap Lorenzo Stoakes
2025-07-11 13:45 ` [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).