Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
       [not found] ` <8f41e72b0543953d277e96d5e67a52f287cdbac3.1752232673.git.lorenzo.stoakes@oracle.com>
@ 2025-07-11 13:34   ` Vlastimil Babka
  2025-07-11 13:49     ` Lorenzo Stoakes
  2025-07-25 17:11   ` Jann Horn
  1 sibling, 1 reply; 7+ messages in thread
From: Vlastimil Babka @ 2025-07-11 13:34 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Jann Horn, Pedro Falcato, Rik van Riel,
	linux-mm, linux-fsdevel, linux-kernel, linux-kselftest, Linux API

+cc linux-api - see the description of the new behavior below

On 7/11/25 13:38, Lorenzo Stoakes wrote:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
> 
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
> 
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
> 
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
> 
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
> 
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
> 
> We only permit this for mremap() operations that do NOT change the size of
> the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
> 
> Should no VMA exist in the range, -EFAULT is returned as usual.
> 
> If a VMA move spans a single VMA - then there is no functional change.
> 
> Otherwise, we place additional requirements upon VMAs:
> 
> * They must not have a userfaultfd context associated with them - this
>   requires dropping the lock to notify users, and we want to perform the
>   operation with the mmap write lock held throughout.
> 
> * If file-backed, they cannot have a custom get_unmapped_area handler -
>   this might result in MREMAP_FIXED not being honoured, which could result
>   in unexpected positioning of VMAs in the moved region.
> 
> There may be gaps in the range of VMAs that are moved:
> 
>                    X        Y                       X        Y
>                  <--->     <->                    <--->     <->
>          |-------|   |-----| |-----|      |-------|   |-----| |-----|
>          |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
>          |-------|   |-----| |-----|      |-------|   |-----| |-----|
>         addr                           new_addr
> 
> The move will preserve the gaps between each VMA.

AFAIU "moving a gap" doesn't mean we unmap anything pre-existing where the
moved gap's range falls to, right? Worth pointing out explicitly.

> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
> 
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.
> 
> We don't try to assess ahead of time whether VMAs are valid according to
> the multi VMA rules, as it would be rather unusual for a user to mix
> uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
> specify custom get_unmapped_area() handlers in an aggregate operation.
> 
> So we optimise for the far, far more likely case of the operation being
> entirely permissible.

Guess it's the sanest thing to do given all the cirumstances.

> In the case of the move of a single VMA, the above conditions are
> permitted. This makes the behaviour identical for a single VMA as before.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some nits:

> ---
>  mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 150 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 8cb08ccea6ad..59f49de0f84e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -69,6 +69,8 @@ struct vma_remap_struct {
>  	enum mremap_type remap_type;	/* expand, shrink, etc. */
>  	bool mmap_locked;		/* Is mm currently write-locked? */
>  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> +	bool seen_vma;			/* Is >1 VMA being moved? */

Seems this could be local variable of remap_move().

> +	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
>  };
>  
>  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> @@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  		*new_vma_ptr = NULL;
>  		return -ENOMEM;
>  	}

A newline here?

> +	if (vma != vrm->vma)
> +		vrm->vmi_needs_reset = true;

A comment on what this condition means wouldn't hurt? Is it when "Source vma
may have been merged into new_vma" in copy_vma(), or when not?

(no comments bellow, remaining quoted part left for linux-api reference)

> +
>  	vrm->vma = vma;
>  	pmc.old = vma;
>  	pmc.new = new_vma;
> @@ -1583,6 +1588,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
>  	return false;
>  }
>  
> +/* Does this remap ONLY move mappings? */
> +static bool vrm_move_only(struct vma_remap_struct *vrm)
> +{
> +	if (!(vrm->flags & MREMAP_FIXED))
> +		return false;
> +
> +	if (vrm->old_len != vrm->new_len)
> +		return false;
> +
> +	return true;
> +}
> +
>  static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1597,6 +1614,32 @@ static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
>  	userfaultfd_unmap_complete(mm, vrm->uf_unmap);
>  }
>  
> +static bool vma_multi_allowed(struct vm_area_struct *vma)
> +{
> +	struct file *file;
> +
> +	/*
> +	 * We can't support moving multiple uffd VMAs as notify requires
> +	 * mmap lock to be dropped.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		return false;
> +
> +	/*
> +	 * Custom get unmapped area might result in MREMAP_FIXED not
> +	 * being obeyed.
> +	 */
> +	file = vma->vm_file;
> +	if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) {
> +		const struct file_operations *fop = file->f_op;
> +
> +		if (fop->get_unmapped_area)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>  static int check_prep_vma(struct vma_remap_struct *vrm)
>  {
>  	struct vm_area_struct *vma = vrm->vma;
> @@ -1646,7 +1689,19 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
>  			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
>  		return -EINVAL;
>  
> -	/* We can't remap across vm area boundaries */
> +	/*
> +	 * We can't remap across the end of VMAs, as another VMA may be
> +	 * adjacent:
> +	 *
> +	 *       addr   vma->vm_end
> +	 *  |-----.----------|
> +	 *  |     .          |
> +	 *  |-----.----------|
> +	 *        .<--------->xxx>
> +	 *            old_len
> +	 *
> +	 * We also require that vma->vm_start <= addr < vma->vm_end.
> +	 */
>  	if (old_len > vma->vm_end - addr)
>  		return -EFAULT;
>  
> @@ -1746,6 +1801,90 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
>  	return 0;
>  }
>  
> +static unsigned long remap_move(struct vma_remap_struct *vrm)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long start = vrm->addr;
> +	unsigned long end = vrm->addr + vrm->old_len;
> +	unsigned long new_addr = vrm->new_addr;
> +	unsigned long target_addr = new_addr;
> +	unsigned long res = -EFAULT;
> +	unsigned long last_end;
> +	bool allowed = true;
> +	VMA_ITERATOR(vmi, current->mm, start);
> +
> +	/*
> +	 * When moving VMAs we allow for batched moves across multiple VMAs,
> +	 * with all VMAs in the input range [addr, addr + old_len) being moved
> +	 * (and split as necessary).
> +	 */
> +	for_each_vma_range(vmi, vma, end) {
> +		/* Account for start, end not aligned with VMA start, end. */
> +		unsigned long addr = max(vma->vm_start, start);
> +		unsigned long len = min(end, vma->vm_end) - addr;
> +		unsigned long offset, res_vma;
> +
> +		if (!allowed)
> +			return -EFAULT;
> +
> +		/* No gap permitted at the start of the range. */
> +		if (!vrm->seen_vma && start < vma->vm_start)
> +			return -EFAULT;
> +
> +		/*
> +		 * To sensibly move multiple VMAs, accounting for the fact that
> +		 * get_unmapped_area() may align even MAP_FIXED moves, we simply
> +		 * attempt to move such that the gaps between source VMAs remain
> +		 * consistent in destination VMAs, e.g.:
> +		 *
> +		 *           X        Y                       X        Y
> +		 *         <--->     <->                    <--->     <->
> +		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
> +		 * |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
> +		 * |-------|   |-----| |-----|      |-------|   |-----| |-----|
> +		 *                               new_addr
> +		 *
> +		 * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
> +		 */
> +		offset = vrm->seen_vma ? vma->vm_start - last_end : 0;
> +		last_end = vma->vm_end;
> +
> +		vrm->vma = vma;
> +		vrm->addr = addr;
> +		vrm->new_addr = target_addr + offset;
> +		vrm->old_len = vrm->new_len = len;
> +
> +		allowed = vma_multi_allowed(vma);
> +		if (vrm->seen_vma && !allowed)
> +			return -EFAULT;
> +
> +		res_vma = check_prep_vma(vrm);
> +		if (!res_vma)
> +			res_vma = mremap_to(vrm);
> +		if (IS_ERR_VALUE(res_vma))
> +			return res_vma;
> +
> +		if (!vrm->seen_vma) {
> +			VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
> +			res = res_vma;
> +		}
> +
> +		/* mmap lock is only dropped on shrink. */
> +		VM_WARN_ON_ONCE(!vrm->mmap_locked);
> +		/* This is a move, no expand should occur. */
> +		VM_WARN_ON_ONCE(vrm->populate_expand);
> +
> +		if (vrm->vmi_needs_reset) {
> +			vma_iter_reset(&vmi);
> +			vrm->vmi_needs_reset = false;
> +		}
> +		vrm->seen_vma = true;
> +		target_addr = res_vma + vrm->new_len;
> +	}
> +
> +	return res;
> +}
> +
>  static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1763,13 +1902,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  		return -EINTR;
>  	vrm->mmap_locked = true;
>  
> -	vrm->vma = vma_lookup(current->mm, vrm->addr);
> -	res = check_prep_vma(vrm);
> -	if (res)
> -		goto out;
> +	if (vrm_move_only(vrm)) {
> +		res = remap_move(vrm);
> +	} else {
> +		vrm->vma = vma_lookup(current->mm, vrm->addr);
> +		res = check_prep_vma(vrm);
> +		if (res)
> +			goto out;
>  
> -	/* Actually execute mremap. */
> -	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +		/* Actually execute mremap. */
> +		res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +	}
>  
>  out:
>  	failed = IS_ERR_VALUE(res);


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs
       [not found] <cover.1752232673.git.lorenzo.stoakes@oracle.com>
       [not found] ` <8f41e72b0543953d277e96d5e67a52f287cdbac3.1752232673.git.lorenzo.stoakes@oracle.com>
@ 2025-07-11 13:45 ` Lorenzo Stoakes
  1 sibling, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 13:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, linux-api

Apologies, I've done it again... :) Friday eh?

+cc linux-api - FYI guys.

I will update manpage to reflect this change of course also.

On Fri, Jul 11, 2025 at 12:38:14PM +0100, Lorenzo Stoakes wrote:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
>
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
>
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
>
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
>
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
>
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
>
> In order to do this, this series performs a large amount of refactoring,
> most pertinently - grouping sanity checks together, separately those that
> check input parameters and those relating to VMAs.
>
> we also simplify the post-mmap lock drop processing for uffd and mlock()'d
> VMAs.
>
> With this done, we can then fairly straightforwardly implement this
> functionality.
>
> This works exclusively for mremap() invocations which specify
> MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
> notification of the userland fault handler would require us to drop the
> mmap lock.
>
> It is also not compatible with file-backed mappings with customised
> get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
>
> The input and output addresses ranges must not overlap. We carefully
> account for moves which would result in VMA iterator invalidation.
>
> While there can be gaps between VMAs in the input range, there can be no
> gap before the first VMA in the range.
>
>
> v3:
> * Disallowed move operation except for MREMAP_FIXED.
> * Disallow gap at start of aggregate range to avoid confusion.
> * Disallow any file-baked VMAs with custom get_unmapped_area.
> * Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use
>   separate target_addr var to track next target address.
> * Check if first VMA fails multi VMA check, if so we'll allow one VMA but
>   not multiple.
> * Updated the commit message for patch 9 to be clearer about gap behaviour.
> * Removed accidentally included debug goto statement in test (doh!). Test
>   was and is passing regardless.
> * Unmap target range in test, previously we ended up moving additional VMAs
>   unintentionally. This still all passed :) but was not what was intended.
> * Removed self-merge check - there is absolutely no way this can happen
>   across multiple VMAs, as there is no means of moving VMAs such that a VMA
>   merges with itself.
>
> v2:
> * Squashed uffd stub fix into series.
> * Propagated tags, thanks!
> * Fixed param naming in patch 4 as per Vlastimil.
> * Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
>   Liam.
> * Correctly return -EFAULT if no VMAs in input range.
> * Account for get_unmapped_area() disregarding MAP_FIXED and returning an
>   altered address.
> * Added additional explanatatory comment to the remap_move() function.
> https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
>
> v1:
> https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
>
>
> Lorenzo Stoakes (10):
>   mm/mremap: perform some simple cleanups
>   mm/mremap: refactor initial parameter sanity checks
>   mm/mremap: put VMA check and prep logic into helper function
>   mm/mremap: cleanup post-processing stage of mremap
>   mm/mremap: use an explicit uffd failure path for mremap
>   mm/mremap: check remap conditions earlier
>   mm/mremap: move remap_is_valid() into check_prep_vma()
>   mm/mremap: clean up mlock populate behaviour
>   mm/mremap: permit mremap() move of multiple VMAs
>   tools/testing/selftests: extend mremap_test to test multi-VMA mremap
>
>  fs/userfaultfd.c                         |  15 +-
>  include/linux/userfaultfd_k.h            |   5 +
>  mm/mremap.c                              | 553 +++++++++++++++--------
>  tools/testing/selftests/mm/mremap_test.c | 146 +++++-
>  4 files changed, 518 insertions(+), 201 deletions(-)
>
> --
> 2.50.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-11 13:34   ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Vlastimil Babka
@ 2025-07-11 13:49     ` Lorenzo Stoakes
  0 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-07-11 13:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Jann Horn, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 11, 2025 at 03:34:23PM +0200, Vlastimil Babka wrote:
> +cc linux-api - see the description of the new behavior below

Ah yeah :) I sent on 0/10 also. Friday...

>
> On 7/11/25 13:38, Lorenzo Stoakes wrote:
> > Historically we've made it a uAPI requirement that mremap() may only
> > operate on a single VMA at a time.
> >
> > For instances where VMAs need to be resized, this makes sense, as it
> > becomes very difficult to determine what a user actually wants should they
> > indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> > Adjust sizes individually? Some other strategy?).
> >
> > However, in instances where a user is moving VMAs, it is restrictive to
> > disallow this.
> >
> > This is especially the case when anonymous mapping remap may or may not be
> > mergeable depending on whether VMAs have or have not been faulted due to
> > anon_vma assignment and folio index alignment with vma->vm_pgoff.
> >
> > Often this can result in surprising impact where a moved region is faulted,
> > then moved back and a user fails to observe a merge from otherwise
> > compatible, adjacent VMAs.
> >
> > This change allows such cases to work without the user having to be
> > cognizant of whether a prior mremap() move or other VMA operations has
> > resulted in VMA fragmentation.
> >
> > We only permit this for mremap() operations that do NOT change the size of
> > the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
> >
> > Should no VMA exist in the range, -EFAULT is returned as usual.
> >
> > If a VMA move spans a single VMA - then there is no functional change.
> >
> > Otherwise, we place additional requirements upon VMAs:
> >
> > * They must not have a userfaultfd context associated with them - this
> >   requires dropping the lock to notify users, and we want to perform the
> >   operation with the mmap write lock held throughout.
> >
> > * If file-backed, they cannot have a custom get_unmapped_area handler -
> >   this might result in MREMAP_FIXED not being honoured, which could result
> >   in unexpected positioning of VMAs in the moved region.
> >
> > There may be gaps in the range of VMAs that are moved:
> >
> >                    X        Y                       X        Y
> >                  <--->     <->                    <--->     <->
> >          |-------|   |-----| |-----|      |-------|   |-----| |-----|
> >          |   A   |   |  B  | |  C  | ---> |   A'  |   |  B' | |  C' |
> >          |-------|   |-----| |-----|      |-------|   |-----| |-----|
> >         addr                           new_addr
> >
> > The move will preserve the gaps between each VMA.
>
> AFAIU "moving a gap" doesn't mean we unmap anything pre-existing where the
> moved gap's range falls to, right? Worth pointing out explicitly.
>
> > Note that any failures encountered will result in a partial move. Since an
> > mremap() can fail at any time, this might result in only some of the VMAs
> > being moved.
> >
> > Note that failures are very rare and typically require an out of a memory
> > condition or a mapping limit condition to be hit, assuming the VMAs being
> > moved are valid.
> >
> > We don't try to assess ahead of time whether VMAs are valid according to
> > the multi VMA rules, as it would be rather unusual for a user to mix
> > uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
> > specify custom get_unmapped_area() handlers in an aggregate operation.
> >
> > So we optimise for the far, far more likely case of the operation being
> > entirely permissible.
>
> Guess it's the sanest thing to do given all the cirumstances.
>
> > In the case of the move of a single VMA, the above conditions are
> > permitted. This makes the behaviour identical for a single VMA as before.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits:
>
> > ---
> >  mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 150 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 8cb08ccea6ad..59f49de0f84e 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -69,6 +69,8 @@ struct vma_remap_struct {
> >  	enum mremap_type remap_type;	/* expand, shrink, etc. */
> >  	bool mmap_locked;		/* Is mm currently write-locked? */
> >  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> > +	bool seen_vma;			/* Is >1 VMA being moved? */
>
> Seems this could be local variable of remap_move().

Yes, this is because before there _was_ some external use, but after rework
not any more. Will fix up in a fix-patch.

>
> > +	bool vmi_needs_reset;		/* Was the VMA iterator invalidated? */
> >  };
> >
> >  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > @@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >  		*new_vma_ptr = NULL;
> >  		return -ENOMEM;
> >  	}
>
> A newline here?

I kinda thought it made sense to 'group' it with logic above, so this was
on purpose.

>
> > +	if (vma != vrm->vma)
> > +		vrm->vmi_needs_reset = true;
>
> A comment on what this condition means wouldn't hurt? Is it when "Source vma
> may have been merged into new_vma" in copy_vma(), or when not?
>

Sure will add in a fix-patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
       [not found] ` <8f41e72b0543953d277e96d5e67a52f287cdbac3.1752232673.git.lorenzo.stoakes@oracle.com>
  2025-07-11 13:34   ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Vlastimil Babka
@ 2025-07-25 17:11   ` Jann Horn
  2025-07-25 17:27     ` Lorenzo Stoakes
  1 sibling, 1 reply; 7+ messages in thread
From: Jann Horn @ 2025-07-25 17:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
>
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.

Hrm. So if userspace tries to move a series of VMAs with mremap(), and
the operation fails, and userspace assumes the old syscall semantics,
userspace could assume that its memory is still at the old address,
when that's actually not true; and if userspace tries to access it
there, userspace UAF happens?

If we were explicitly killing the userspace process on this error
path, that'd be fine; but since we're just returning an error, we're
kind of making userspace believe that the move hasn't happened? (You
might notice that I'm generally in favor of killing userspace
processes when userspace does sufficiently weird things.)

I guess it's not going to happen particularly often since mremap()
with MREMAP_FIXED is a weirdly specific operation in the first place;
normal users of mremap() (like libc's realloc()) wouldn't have a
reason to use it...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 17:11   ` Jann Horn
@ 2025-07-25 17:27     ` Lorenzo Stoakes
  2025-07-25 19:10       ` Jann Horn
  0 siblings, 1 reply; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 17:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Note that any failures encountered will result in a partial move. Since an
> > mremap() can fail at any time, this might result in only some of the VMAs
> > being moved.
> >
> > Note that failures are very rare and typically require an out of a memory
> > condition or a mapping limit condition to be hit, assuming the VMAs being
> > moved are valid.
>
> Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> the operation fails, and userspace assumes the old syscall semantics,
> userspace could assume that its memory is still at the old address,
> when that's actually not true; and if userspace tries to access it
> there, userspace UAF happens?

At 6pm on the last day of the cycle? :) dude :) this long week gets ever
longer...

I doubt any logic like this really exists, since mremap() is actually quite hard
to fail, and it'll likely be a mapping limit or oom issue (the latter being 'too
small to fail' so would mean your system is about to die anyway).

So it'd be very strange to then rely on that.

And the _usual_ sensible reasons why this might fail, would likely fail for
the first mapping (mapping limit, you'd probably hit it only on that one,
etc.)

The _unusual_ errors are typically user errors - you try to use with uffd
mappings, etc. well then that's not a functional program and we don't need
to worry.

And I sent a manpage change which very explicitly describes this behaviour.

In any case there is simply _no way_ to not do this.

If the failure's because of oom, we're screwed anyway and user segfault is
inevitable, we'll get into a pickle trying to move back.

Otherwise for mapping limit we likely hit it right away. I moved all the
checks up front for standard VMA/param errors.

The other kinds of errors would require you to try to move normal VMAs
right next to driver VMAs or a mix of uffd and not uffd VMAs or something
that'll be your fault.

So basically it'll nearly never happen, and it doesn't make much sense for
code to rely on this failing.

>
> If we were explicitly killing the userspace process on this error
> path, that'd be fine; but since we're just returning an error, we're
> kind of making userspace believe that the move hasn't happened? (You
> might notice that I'm generally in favor of killing userspace
> processes when userspace does sufficiently weird things.)

Well if we get them to segfault... :P

I think it's userspace's fault if they try to 'recover' based on shakey
assumptions.

>
> I guess it's not going to happen particularly often since mremap()
> with MREMAP_FIXED is a weirdly specific operation in the first place;
> normal users of mremap() (like libc's realloc()) wouldn't have a
> reason to use it...

Yes, and you would have to be using it such a way that things would have
broken before for very specific reasons.

I think the only truly viable 'if this fails assume still in place' might
very well be if the VMAs happened to be fragmented, which obviously this
now changes :)

So I don't think there's an issue here. Additionally it's very much 'the
kernel way' that partially failed aggregate operations don't unwind.

The key thing about mremap is that each individual move will be unwound
such that page tables remain valid...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 17:27     ` Lorenzo Stoakes
@ 2025-07-25 19:10       ` Jann Horn
  2025-07-25 19:59         ` Lorenzo Stoakes
  0 siblings, 1 reply; 7+ messages in thread
From: Jann Horn @ 2025-07-25 19:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > Note that any failures encountered will result in a partial move. Since an
> > > mremap() can fail at any time, this might result in only some of the VMAs
> > > being moved.
> > >
> > > Note that failures are very rare and typically require an out of a memory
> > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > moved are valid.
> >
> > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > the operation fails, and userspace assumes the old syscall semantics,
> > userspace could assume that its memory is still at the old address,
> > when that's actually not true; and if userspace tries to access it
> > there, userspace UAF happens?
>
> At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> longer...

To be clear, I very much do not expect you to instantly reply to
random patch review mail I send you late on a Friday evening. :P

> Otherwise for mapping limit we likely hit it right away. I moved all the
> checks up front for standard VMA/param errors.

Ah, I missed that part.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
  2025-07-25 19:10       ` Jann Horn
@ 2025-07-25 19:59         ` Lorenzo Stoakes
  0 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 19:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API

On Fri, Jul 25, 2025 at 09:10:25PM +0200, Jann Horn wrote:
> On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > Note that any failures encountered will result in a partial move. Since an
> > > > mremap() can fail at any time, this might result in only some of the VMAs
> > > > being moved.
> > > >
> > > > Note that failures are very rare and typically require an out of a memory
> > > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > > moved are valid.
> > >
> > > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > > the operation fails, and userspace assumes the old syscall semantics,
> > > userspace could assume that its memory is still at the old address,
> > > when that's actually not true; and if userspace tries to access it
> > > there, userspace UAF happens?
> >
> > At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> > longer...
>
> To be clear, I very much do not expect you to instantly reply to
> random patch review mail I send you late on a Friday evening. :P

Haha sure, just keen to clarify things on this!

>
> > Otherwise for mapping limit we likely hit it right away. I moved all the
> > checks up front for standard VMA/param errors.
>
> Ah, I missed that part.

Yeah the refactoring part of the series prior to this patch goes to great
lengths to prepare for this (including in moving tests earlier - all of
which I confirmed were good to be done earlier).

:)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-25 20:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1752232673.git.lorenzo.stoakes@oracle.com>
     [not found] ` <8f41e72b0543953d277e96d5e67a52f287cdbac3.1752232673.git.lorenzo.stoakes@oracle.com>
2025-07-11 13:34   ` [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs Vlastimil Babka
2025-07-11 13:49     ` Lorenzo Stoakes
2025-07-25 17:11   ` Jann Horn
2025-07-25 17:27     ` Lorenzo Stoakes
2025-07-25 19:10       ` Jann Horn
2025-07-25 19:59         ` Lorenzo Stoakes
2025-07-11 13:45 ` [PATCH v3 00/10] " Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).