[PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
@ 2025-06-09 13:26 Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
                   ` (13 more replies)
  0 siblings, 14 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

A long standing issue with VMA merging of anonymous VMAs is the requirement
to maintain both vma->vm_pgoff and anon_vma compatibility between merge
candidates.

For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
refer to virtual page offsets, that is, va >> PAGE_SHIFT.

However upon mremap() of an anonymous mapping that has been faulted (that
is, where vma->anon_vma != NULL), we would then need to walk page tables to
be able to access let alone manipulate folio->index, mapping fields to
permit an update of this virtual page offset.

Therefore in these instances, we do not do so, instead retaining the
virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
field, and of course consequently folio->index.

On each occasion we use linear_page_index() to determine the appropriate
offset, cleverly offset the vma->vm_pgoff field by the difference between
the virtual address and actual VMA start.

Doing so in effect fragments the virtual address space, meaning that we are
no longer able to merge these VMAs with adjacent ones that could, at least
theoretically, be merged.

This also creates a difference in behaviour, often surprising to users,
between mappings which are faulted and those which are not - as for the
latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.

This is problematic firstly because this proliferates kernel allocations
that are pure memory pressure - unreclaimable and unmovable -
i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.

Secondly, mremap() exhibits an implicit uAPI in that it does not permit
remaps which span multiple VMAs (though it does permit remaps that
constitute a part of a single VMA).

This means that a user must concern themselves with whether merges succeed
or not should they wish to use mremap() in such a way which causes multiple
mremap() calls to be performed upon mappings.

This series provides users with an option to accept the overhead of
actually updating the VMA and underlying folios via the
MREMAP_RELOCATE_ANON flag.

If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
the mremap() succeeding, then no attempt is made at relocation of folios as
this is not required.

Even if no merge is possible upon moving of the region, vma->vm_pgoff and
folio->index fields are appropriately updated in order that subsequent
mremap() or mprotect() calls will succeed in merging.

This flag falls back to the ordinary means of mremap() should the operation
not be feasible. It also transparently undoes the operation, carefully
holding rmap locks such that no racing rmap operation encounters incorrect
or missing VMAs.

In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
user needs to know whether or not the operation succeeded - this flag is
identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
the mremap() fails with -EFAULT.

Note that no-op mremap() operations (such as an unpopulated range, or a
merge that would trivially succeed already) will succeed under
MREMAP_MUST_RELOCATE_ANON.

mremap() already walks page tables, so it isn't an order of magntitude
increase in workload, but constitutes the need to walk to page table leaf
level and manipulate folios.

The operations all succeed under THP and in general are compatible with
underlying large folios of any size. In fact, the larger the folio, the
more efficient the operation is.

Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
on the same order of magnitude of ordinary mremap() operations, with both
exhibiting time to the proportion of the mapping which is populated.

Of course, mremap() operations that are entirely aligned are significantly
faster as they need only move a VMA and a smaller number of higher order
page tables, but this is unavoidable.

Previous efforts in this area
=============================

An approach addressing this issue was previously suggested by Jakub Matena
in a series posted a few years ago in [0] (and discussed in a masters
thesis).

However this was a more general effort which attempted to always make
anonymous mappings more mergeable, and therefore was not quite ready for
the upstream limelight. In addition, large folio work which has occurred
since requires us to carefully consider and account for this.

This series is more conservative and targeted (one must specific a flag to
get this behaviour) and additionally goes to great efforts to handle large
folios and account all of the nitty gritty locking concerns that might
arise in current kernel code.

Thanks goes out to Jakub for his efforts however, and hopefully this effort
to take a slightly different approach to the same problem is pleasing to
him regardless :)

[0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/

Use-cases
=========

* ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
  upon which makes use of extensive mremap() operations to perform
  defragmentation of objects, taking advantage of the plentiful available
  virtual address space in a 64-bit system.

  In instances where one VMA is faulted in and another not, merging is not
  possible, which leads to significant, unreclaimable, kernel metadata
  overhead and contention on the vm.max_map_count limit.

  This series eliminates the issue entirely.
* It was indicated that Android similarly moves memory around and
  encounters the very same issues as ZGC.
* SUSE indicate they have encountered similar issues as pertains to an
  internal client.

Past approaches
===============

In discussions at LSF/MM/BPF It was suggested that we could make this an
madvise() operation, however at this point it will be too late to correctly
perform the merge, requiring an unmap/remap which would be egregious.

It was further suggested that we simply defer the operation to the point at
which an mremap() is attempted on multiple immediately adjacent VMAs (that
is - to allow VMA fragmentation up until the point where it might cause
perceptible issues with uAPI).

This is problematic in that in the first instance - you accrue
fragmentation, and only if you were to try to move the fragmented objects
again would you resolve it.

Additionally you would not be able to handle the mprotect() case, and you'd
have the same issue as the madvise() approach in that you'd need to
essentially re-map each VMA.

Additionally it would become non-trivial to correctly merge the VMAs - if
there were more than 3, we would need to invent a new merging mechanism
specifically for this, hold locks carefully over each to avoid them
disappearing from beneath us and introduce a great deal of non-optional
complexity.

While imperfect, the mremap flag approach seems the least invasive most
workable solution (until further rework of the anon_vma mechanism can be
achieved!)

Testing
=======

* Significantly expanded self-tests, all of which are passing.
* Explicit testing of forked cases including anon_vma reuse, all passing
  correctly.
* Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
  mremap()'s.
* Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
  (kernel compilation, etc.)
* Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
  on real hardware.

Series History
==============

Non-RFC:
* Rebased on mm-new and fixed merge conflicts, re-confirmed building and
  all tests passing.
* Seems to have settled down with all feedback previously raised addressed,
  so un-RFC'd to propose the series for mainline, timed for the start of
  the 6.16 rc cycle (thus targeting 6.17).

RFC v3:
* Rebased on and fixed conflicts against mm-new.
* Removed invalid use of folio_test_large_maybe_mapped_shared() in
  __relocate_large_folio() - this has since been removed and inlined (see
  [0]) anyway but we should be using folio_maybe_mapped_shared() here at
  any rate.
* Moved unnecessary folio large, ksm checks in __relocate_large_folio() to
  relocate_large_folio() - we already check this in relocate_anon_pte() so
  this is duplicated in that case.
* Added new tests explicitly checking that MREMAP_MUST_RELOCATE_ANON fails
  for forked processes, both forked children with parents as indicated by
  avc, and forked parents with children.
* Added anon_vma_assert_locked() helper.
* Removed vma_had_uncowed_children() as it was incorrectly implemented (it
  didn't account for grandchildren and descendents being not being
  self-parented), and replaced with a general
  vma_maybe_has_shared_anon_folios() function which checks both parent and
  child VMAs. Wei raised a concern in this area, this helps clarify and
  correct.
* Converted anon_vma vs. mmap lock check in
  vma_maybe_has_shared_anon_folios() to be more sensible and to assume the
  caller hold sufficient locks (checked with assert).
* Added additional recipients based on recent MAINTAINERS changes.
* Added missing reference to Jakub's efforts in this area a few years ago
  to cover letter. Thanks Jakub!
https://lore.kernel.org/all/cover.1746305604.git.lorenzo.stoakes@oracle.com/

RFC v2:
* Added folio_mapcount() check on relocate anon to assert exclusively
  mapped as per Jann.
* Added check for anon_vma->num_children > nr_pages in
  should_relocate_anon() as per Jann.
* Separated out vma_had_uncowed_parents() into shared helper function and
  added vma_had_uncowed_children() to implement the above.
* Add comment clarifying why we do not require an rmap lock on the old VMA
  due to fork requiring an mmap write lock which we hold.
* Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
* Checked for folio pinning and abort if in place. We do so, because this
  implies the folio is being used by the kernel for a time longer than the
  time over which an mmap lock is held (which will not be held at the time
  of us manipulating the folio, as we hold the mmap write lock). We are
  manipulating mapping, index fields and being conservative (additionally
  mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
  pin isn't somehow relying on these not being manipulated. As per David.
* Propagated mapcount, maybe DMA pinned checks to large folio logic.
* Added folio splitting - on second thoughts, it would be a bit silly to
  simply disallow the request because of large folio misalignment, work
  around this by splitting the folio in this instance.
* Added very careful handling around rmap lock, making use of
  folio_anon_vma(), to ensure we do not deadlock on anon_vma.
* Prefer vm_normal_folio() to vm_normal_page() & page_folio().
* Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
* Provided sys_mremap() helper in vm_util.[ch] to be shared among test
  callers and de-duplicate. This must be a raw system call, as glibc will
  otherwise filter the flags.
* Expanded the mm CoW self-tests to explicitly test with
  MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
  exercises split_folio() code paths explicitly. Additionally some cases
  cannot succeed, so we also exercise undo paths.
* Added explicit lockdep handling to teach it that we are handling two
  distinct anon_vma locks so it doesn't spuriously report a deadlock.
* Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
  strictly be necessary as we explicitly limit ourselves to unforked
  anon_vma's, but it is more correct to do so, as this is where the lock is
  located.
* Expanded the split_huge_page_test.c test to also test using the
  MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.
https://lore.kernel.org/all/cover.1745307301.git.lorenzo.stoakes@oracle.com/

RFC v1:
https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (11):
  mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  mm/mremap: add MREMAP_MUST_RELOCATE_ANON
  mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
  tools UAPI: Update copy of linux/mman.h from the kernel sources
  tools/testing/selftests: add sys_mremap() helper to vm_util.h
  tools/testing/selftests: add mremap() cases that merge normally
  tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
  tools/testing/selftests: expand mremap() tests for
    MREMAP_RELOCATE_ANON
  tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
  tools/testing/selftests: test relocate anon in split huge page test
  tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests

 include/linux/rmap.h                          |    4 +
 include/uapi/linux/mman.h                     |    8 +-
 mm/internal.h                                 |    1 +
 mm/mremap.c                                   |  719 ++++++-
 mm/vma.c                                      |   77 +-
 mm/vma.h                                      |   36 +-
 tools/include/uapi/linux/mman.h               |    8 +-
 tools/testing/selftests/mm/cow.c              |   23 +-
 tools/testing/selftests/mm/merge.c            | 1690 ++++++++++++++++-
 tools/testing/selftests/mm/mremap_test.c      |  262 ++-
 .../selftests/mm/split_huge_page_test.c       |   25 +-
 tools/testing/selftests/mm/vm_util.c          |    8 +
 tools/testing/selftests/mm/vm_util.h          |    3 +
 tools/testing/vma/vma.c                       |    5 +-
 tools/testing/vma/vma_internal.h              |   38 +
 15 files changed, 2732 insertions(+), 175 deletions(-)

--
2.49.0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-16 20:58   ` David Hildenbrand
                     ` (2 more replies)
  2025-06-09 13:26 ` [PATCH 02/11] mm/mremap: add MREMAP_MUST_RELOCATE_ANON Lorenzo Stoakes
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

When mremap() moves a mapping around in memory, it goes to great lengths to
avoid having to walk page tables as this is expensive and
time-consuming.

Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
page offset stored in the VMA at vma->vm_pgoff will remain the same, as
well all the folio indexes pointed at the associated anon_vma object.

This means the VMA and page tables can simply be moved and this affects the
change (and if we can move page tables at a higher page table level, this
is even faster).

While this is efficient, it does lead to big problems with VMA merging - in
essence it causes faulted anonymous VMAs to not be mergeable under many
circumstances once moved.

This is limiting and leads to both a proliferation of unreclaimable,
unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
impact on further use of mremap(), which has a requirement that the VMA
moved (which can also be a partial range within a VMA) may span only a
single VMA.

This makes the mergeability or not of VMAs in effect a uAPI concern.

In some use cases, users may wish to accept the overhead of actually going
to the trouble of updating VMAs and folios to affect mremap() moves. Let's
provide them with the choice.

This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
attempts to perform such an operation. If it is unable to do so, it cleanly
falls back to the usual method.

It carefully takes the rmap locks such that at no time will a racing rmap
user encounter incorrect or missing VMAs.

It is also designed to interact cleanly with the existing mremap() error
fallback mechanism (inverting the remap should the page table move fail).

Also, if we could merge cleanly without such a change, we do so, avoiding
the overhead of the operation if it is not required.

In the instance that no merge may occur when the move is performed, we
still perform the folio and VMA updates to ensure that future mremap() or
mprotect() calls will result in merges.

In this implementation, we simply give up if we encounter large folios. A
subsequent commit will extend the functionality to allow for these cases.

We restrict this flag to purely anonymous memory only.

we separate out the vma_had_uncowed_parents() helper function for checking
in should_relocate_anon() and introduce a new function
vma_maybe_has_shared_anon_folios() which combines a check against this and
any forked child anon_vma's.

We carefully check for pinned folios in case a caller who holds a pin might
make assumptions about index, mapping fields which we are about to
manipulate.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/rmap.h             |   4 +
 include/uapi/linux/mman.h        |   1 +
 mm/internal.h                    |   1 +
 mm/mremap.c                      | 403 +++++++++++++++++++++++++++++--
 mm/vma.c                         |  77 ++++--
 mm/vma.h                         |  36 ++-
 tools/testing/vma/vma.c          |   5 +-
 tools/testing/vma/vma_internal.h |  38 +++
 8 files changed, 520 insertions(+), 45 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4f4903b1088..6d2b3fbe2df0 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -147,6 +147,10 @@ static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
 	up_read(&anon_vma->root->rwsem);
 }
 
+static inline void anon_vma_assert_locked(const struct anon_vma *anon_vma)
+{
+	rwsem_assert_held(&anon_vma->root->rwsem);
+}
 
 /*
  * anon_vma helper functions.
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..d0542f872e0c 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -9,6 +9,7 @@
 #define MREMAP_MAYMOVE		1
 #define MREMAP_FIXED		2
 #define MREMAP_DONTUNMAP	4
+#define MREMAP_RELOCATE_ANON	8
 
 #define OVERCOMMIT_GUESS		0
 #define OVERCOMMIT_ALWAYS		1
diff --git a/mm/internal.h b/mm/internal.h
index 71eaea2db9b0..e18f8dcd9794 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -46,6 +46,7 @@ struct folio_batch;
 struct pagetable_move_control {
 	struct vm_area_struct *old; /* Source VMA. */
 	struct vm_area_struct *new; /* Destination VMA. */
+	struct vm_area_struct *relocate_locked; /* VMA which is rmap locked. */
 	unsigned long old_addr; /* Address from which the move begins. */
 	unsigned long old_end; /* Exclusive address at which old range ends. */
 	unsigned long new_addr; /* Address to move page tables to. */
diff --git a/mm/mremap.c b/mm/mremap.c
index 60f6b8d0d5f0..2da064f8c898 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -71,6 +71,15 @@ struct vma_remap_struct {
 	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
 };
 
+/* Represents local PTE state. */
+struct pte_state {
+	unsigned long old_addr;
+	unsigned long new_addr;
+	unsigned long old_end;
+	pte_t *ptep;
+	spinlock_t *ptl;
+};
+
 static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -139,18 +148,50 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
 	return pmd;
 }
 
-static void take_rmap_locks(struct vm_area_struct *vma)
+/*
+ * Determine whether the old and new VMAs share the same anon_vma. If so, this
+ * has implications around locking and to avoid deadlock we need to tread
+ * carefully.
+ */
+static bool has_shared_anon_vma(struct pagetable_move_control *pmc)
+{
+	struct vm_area_struct *vma = pmc->old;
+	struct vm_area_struct *locked = pmc->relocate_locked;
+
+	if (!locked)
+		return false;
+
+	return vma->anon_vma->root == locked->anon_vma->root;
+}
+
+static void maybe_take_rmap_locks(struct pagetable_move_control *pmc)
 {
+	struct vm_area_struct *vma;
+	struct anon_vma *anon_vma;
+
+	if (!pmc->need_rmap_locks)
+		return;
+
+	vma = pmc->old;
+	anon_vma = vma->anon_vma;
 	if (vma->vm_file)
 		i_mmap_lock_write(vma->vm_file->f_mapping);
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	if (anon_vma && !has_shared_anon_vma(pmc))
+		anon_vma_lock_write(anon_vma);
 }
 
-static void drop_rmap_locks(struct vm_area_struct *vma)
+static void maybe_drop_rmap_locks(struct pagetable_move_control *pmc)
 {
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
+	struct vm_area_struct *vma;
+	struct anon_vma *anon_vma;
+
+	if (!pmc->need_rmap_locks)
+		return;
+
+	vma = pmc->old;
+	anon_vma = vma->anon_vma;
+	if (anon_vma && !has_shared_anon_vma(pmc))
+		anon_vma_unlock_write(anon_vma);
 	if (vma->vm_file)
 		i_mmap_unlock_write(vma->vm_file->f_mapping);
 }
@@ -204,8 +245,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	 *   serialize access to individual ptes, but only rmap traversal
 	 *   order guarantees that we won't miss both the old and new ptes).
 	 */
-	if (pmc->need_rmap_locks)
-		take_rmap_locks(vma);
+	maybe_take_rmap_locks(pmc);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
@@ -280,8 +320,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 out:
-	if (pmc->need_rmap_locks)
-		drop_rmap_locks(vma);
+	maybe_drop_rmap_locks(pmc);
 	return err;
 }
 
@@ -539,15 +578,14 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
  * Should move_pgt_entry() acquire the rmap locks? This is either expressed in
  * the PMC, or overridden in the case of normal, larger page tables.
  */
-static bool should_take_rmap_locks(struct pagetable_move_control *pmc,
-				   enum pgt_entry entry)
+static bool should_take_rmap_locks(enum pgt_entry entry)
 {
 	switch (entry) {
 	case NORMAL_PMD:
 	case NORMAL_PUD:
 		return true;
 	default:
-		return pmc->need_rmap_locks;
+		return false;
 	}
 }
 
@@ -559,11 +597,15 @@ static bool move_pgt_entry(struct pagetable_move_control *pmc,
 			   enum pgt_entry entry, void *old_entry, void *new_entry)
 {
 	bool moved = false;
-	bool need_rmap_locks = should_take_rmap_locks(pmc, entry);
+	bool override_locks = false;
 
-	/* See comment in move_ptes() */
-	if (need_rmap_locks)
-		take_rmap_locks(pmc->old);
+	if (!pmc->need_rmap_locks && should_take_rmap_locks(entry)) {
+		override_locks = true;
+
+		pmc->need_rmap_locks = true;
+		/* See comment in move_ptes() */
+		maybe_take_rmap_locks(pmc);
+	}
 
 	switch (entry) {
 	case NORMAL_PMD:
@@ -587,8 +629,9 @@ static bool move_pgt_entry(struct pagetable_move_control *pmc,
 		break;
 	}
 
-	if (need_rmap_locks)
-		drop_rmap_locks(pmc->old);
+	maybe_drop_rmap_locks(pmc);
+	if (override_locks)
+		pmc->need_rmap_locks = false;
 
 	return moved;
 }
@@ -754,6 +797,209 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
 	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
 }
 
+/*
+ * If the folio mapped at the specified pte entry can have its index and mapping
+ * relocated, then do so.
+ *
+ * Returns the number of pages we have traversed, or 0 if the operation failed.
+ */
+static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
+		struct pte_state *state, bool undo)
+{
+	struct folio *folio;
+	struct vm_area_struct *old, *new;
+	pgoff_t new_index;
+	pte_t pte;
+	unsigned long ret = 1;
+	unsigned long old_addr = state->old_addr;
+	unsigned long new_addr = state->new_addr;
+
+	old = pmc->old;
+	new = pmc->new;
+
+	pte = ptep_get(state->ptep);
+
+	/* Ensure we have truly got an anon folio. */
+	folio = vm_normal_folio(old, old_addr, pte);
+	if (!folio)
+		return ret;
+
+	folio_lock(folio);
+
+	/* No-op. */
+	if (!folio_test_anon(folio) || folio_test_ksm(folio))
+		goto out;
+
+	/*
+	 * This should never be the case as we have already checked to ensure
+	 * that the anon_vma is not forked, and we have just asserted that it is
+	 * anonymous.
+	 */
+	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
+		goto out;
+	/* The above check should imply these. */
+	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
+	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
+
+	/*
+	 * A pinned folio implies that it will be used for a duration longer
+	 * than that over which the mmap_lock is held, meaning that another part
+	 * of the kernel may be making use of this folio.
+	 *
+	 * Since we are about to manipulate index & mapping fields, we cannot
+	 * safely proceed because whatever has pinned this folio may then
+	 * incorrectly assume these do not change.
+	 */
+	if (folio_maybe_dma_pinned(folio))
+		goto out;
+
+	/*
+	 * This should not happen as we explicitly disallow this, but check
+	 * anyway.
+	 */
+	if (folio_test_large(folio)) {
+		ret = 0;
+		goto out;
+	}
+
+	if (!undo)
+		new_index = linear_page_index(new, new_addr);
+	else
+		new_index = linear_page_index(old, old_addr);
+
+	/*
+	 * The PTL should keep us safe from unmapping, and the fact the folio is
+	 * a PTE keeps the folio referenced.
+	 *
+	 * The mmap/VMA locks should keep us safe from fork and other processes.
+	 *
+	 * The rmap locks should keep us safe from anything happening to the
+	 * VMA/anon_vma.
+	 *
+	 * The folio lock should keep us safe from reclaim, migration, etc.
+	 */
+	folio_move_anon_rmap(folio, undo ? old : new);
+	WRITE_ONCE(folio->index, new_index);
+
+out:
+	folio_unlock(folio);
+	return ret;
+}
+
+static bool pte_done(struct pte_state *state)
+{
+	return state->old_addr >= state->old_end;
+}
+
+static void pte_next(struct pte_state *state, unsigned long nr_pages)
+{
+	state->old_addr += nr_pages * PAGE_SIZE;
+	state->new_addr += nr_pages * PAGE_SIZE;
+	state->ptep += nr_pages;
+}
+
+static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
+		unsigned long extent, pmd_t *pmdp, bool undo)
+{
+	struct mm_struct *mm = current->mm;
+	struct pte_state state = {
+		.old_addr = pmc->old_addr,
+		.new_addr = pmc->new_addr,
+		.old_end = pmc->old_addr + extent,
+	};
+	pte_t *ptep_start;
+	bool ret;
+	unsigned long nr_pages;
+
+	ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
+	/*
+	 * We prevent faults with mmap write lock, hold the rmap lock and should
+	 * not fail to obtain this lock. Just give up if we can't.
+	 */
+	if (!ptep_start)
+		return false;
+
+	state.ptep = ptep_start;
+	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
+		pte_t pte = ptep_get(state.ptep);
+
+		if (pte_none(pte) || !pte_present(pte)) {
+			nr_pages = 1;
+			continue;
+		}
+
+		nr_pages = relocate_anon_pte(pmc, &state, undo);
+		if (!nr_pages) {
+			ret = false;
+			goto out;
+		}
+	}
+
+	ret = true;
+out:
+	pte_unmap_unlock(ptep_start, state.ptl);
+	return ret;
+}
+
+static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
+{
+	pud_t *pudp;
+	pmd_t *pmdp;
+	unsigned long extent;
+	struct mm_struct *mm = current->mm;
+
+	if (!pmc->len_in)
+		return true;
+
+	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
+		pmd_t pmd;
+		pud_t pud;
+
+		extent = get_extent(NORMAL_PUD, pmc);
+
+		pudp = get_old_pud(mm, pmc->old_addr);
+		if (!pudp)
+			continue;
+		pud = pudp_get(pudp);
+
+		if (pud_trans_huge(pud) || pud_devmap(pud))
+			return false;
+
+		extent = get_extent(NORMAL_PMD, pmc);
+		pmdp = get_old_pmd(mm, pmc->old_addr);
+		if (!pmdp)
+			continue;
+		pmd = pmdp_get(pmdp);
+
+		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
+		    pmd_devmap(pmd))
+			return false;
+
+		if (pmd_none(pmd))
+			continue;
+
+		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
+			return false;
+	}
+
+	return true;
+}
+
+static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
+{
+	unsigned long old_addr = pmc->old_addr;
+	unsigned long new_addr = pmc->new_addr;
+	bool ret;
+
+	ret = __relocate_anon_folios(pmc, undo);
+
+	/* Reset state ready for retry. */
+	pmc->old_addr = old_addr;
+	pmc->new_addr = new_addr;
+
+	return ret;
+}
+
 unsigned long move_page_tables(struct pagetable_move_control *pmc)
 {
 	unsigned long extent;
@@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
 	}
 }
 
+/*
+ * Should we attempt to relocate anonymous folios to the location that the VMA
+ * is being moved to by updating index and mapping fields accordingly?
+ */
+static bool should_relocate_anon(struct vma_remap_struct *vrm,
+	struct pagetable_move_control *pmc)
+{
+	struct vm_area_struct *old = vrm->vma;
+
+	/* Currently we only do this if requested. */
+	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
+		return false;
+
+	/* We can't deal with special or hugetlb mappings. */
+	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
+		return false;
+
+	/* We only support anonymous mappings. */
+	if (!vma_is_anonymous(old))
+		return false;
+
+	/* If no folios are mapped, then no need to attempt this. */
+	if (!old->anon_vma)
+		return false;
+
+	/* We don't allow relocation of non-exclusive folios. */
+	if (vma_maybe_has_shared_anon_folios(old))
+		return false;
+
+	/* Otherwise, we're good to go! */
+	return true;
+}
+
+static void lock_new_anon_vma(struct vm_area_struct *new_vma)
+{
+	/*
+	 * We have a new VMA to reassign folios to. We take a lock on
+	 * its anon_vma so reclaim doesn't fail to unmap mappings.
+	 *
+	 * We have acquired a VMA write lock by now (in vma_link()), so
+	 * we do not have to worry about racing faults.
+	 *
+	 * NOTE: we do NOT need to acquire an rmap lock on the old VMA,
+	 * as forks require an mmap write lock, which we hold.
+	 */
+	anon_vma_lock_write(new_vma->anon_vma);
+
+	/*
+	 * lockdep is unable to differentiate between the anon_vma lock we take
+	 * in the old VMA and the one we are taking here in the new VMA.
+	 *
+	 * In each instance where the old VMA might have its anon_vma
+	 * lock taken, we explicitly check to ensure they are not one
+	 * and the same, avoiding deadlock.
+	 *
+	 * Express this to lockdep through a subclass.
+	 */
+	lock_set_subclass(&new_vma->anon_vma->root->rwsem.dep_map, 1,
+			  _THIS_IP_);
+}
+
 /*
  * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the
  * process. Additionally handle an error occurring on moving of page tables,
@@ -1153,9 +1460,11 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	struct vm_area_struct *new_vma;
 	int err = 0;
 	PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len);
+	bool relocate_anon = should_relocate_anon(vrm, &pmc);
 
+again:
 	new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
-			   &pmc.need_rmap_locks);
+			   &pmc.need_rmap_locks, &relocate_anon);
 	if (!new_vma) {
 		vrm_uncharge(vrm);
 		*new_vma_ptr = NULL;
@@ -1165,12 +1474,59 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	pmc.old = vma;
 	pmc.new = new_vma;
 
+	if (relocate_anon) {
+		lock_new_anon_vma(new_vma);
+		pmc.relocate_locked = new_vma;
+
+		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
+			unsigned long start = new_vma->vm_start;
+			unsigned long size = new_vma->vm_end - start;
+
+			/* Undo if fails. */
+			relocate_anon_folios(&pmc, /* undo= */true);
+			vrm_stat_account(vrm, vrm->new_len);
+
+			anon_vma_unlock_write(new_vma->anon_vma);
+			pmc.relocate_locked = NULL;
+
+			do_munmap(current->mm, start, size, NULL);
+			relocate_anon = false;
+			goto again;
+		}
+	}
+
 	moved_len = move_page_tables(&pmc);
 	if (moved_len < vrm->old_len)
 		err = -ENOMEM;
 	else if (vma->vm_ops && vma->vm_ops->mremap)
 		err = vma->vm_ops->mremap(new_vma);
 
+	if (unlikely(err && relocate_anon)) {
+		relocate_anon_folios(&pmc, /* undo= */true);
+		anon_vma_unlock_write(new_vma->anon_vma);
+		pmc.relocate_locked = NULL;
+	} else if (relocate_anon /* && !err */) {
+		unsigned long addr = vrm->new_addr;
+		unsigned long end = addr + vrm->new_len;
+		VMA_ITERATOR(vmi, vma->vm_mm, addr);
+		VMG_VMA_STATE(vmg, &vmi, NULL, new_vma, addr, end);
+		struct vm_area_struct *merged;
+
+		/*
+		 * Now we have successfully copied page tables and set up
+		 * folios, we can safely drop the anon_vma lock.
+		 */
+		anon_vma_unlock_write(new_vma->anon_vma);
+		pmc.relocate_locked = NULL;
+
+		/* Let's try merge again... */
+		vmg.prev = vma_prev(&vmi);
+		vma_next(&vmi);
+		merged = vma_merge_existing_range(&vmg);
+		if (merged)
+			new_vma = merged;
+	}
+
 	if (unlikely(err)) {
 		PAGETABLE_MOVE(pmc_revert, new_vma, vma, vrm->new_addr,
 			       vrm->addr, moved_len);
@@ -1483,7 +1839,8 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	unsigned long flags = vrm->flags;
 
 	/* Ensure no unexpected flag values. */
-	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
+	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP |
+		      MREMAP_RELOCATE_ANON))
 		return -EINVAL;
 
 	/* Start address must be page-aligned. */
@@ -1498,6 +1855,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	if (!PAGE_ALIGN(vrm->new_len))
 		return -EINVAL;
 
+	/* We can't relocate without allowing a move. */
+	if ((flags & MREMAP_RELOCATE_ANON) && !(flags & MREMAP_MAYMOVE))
+		return -EINVAL;
+
 	/* Remainder of checks are for cases with specific new_addr. */
 	if (!vrm_implies_new_addr(vrm))
 		return 0;
diff --git a/mm/vma.c b/mm/vma.c
index 01b1d26d87b4..326cfec70f9c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -62,22 +62,6 @@ struct mmap_state {
 		.state = VMA_MERGE_START,				\
 	}
 
-/*
- * If, at any point, the VMA had unCoW'd mappings from parents, it will maintain
- * more than one anon_vma_chain connecting it to more than one anon_vma. A merge
- * would mean a wider range of folios sharing the root anon_vma lock, and thus
- * potential lock contention, we do not wish to encourage merging such that this
- * scales to a problem.
- */
-static bool vma_had_uncowed_parents(struct vm_area_struct *vma)
-{
-	/*
-	 * The list_is_singular() test is to avoid merging VMA cloned from
-	 * parents. This can improve scalability caused by anon_vma lock.
-	 */
-	return vma && vma->anon_vma && !list_is_singular(&vma->anon_vma_chain);
-}
-
 static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
 {
 	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
@@ -801,8 +785,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma)
  * - The caller must hold a WRITE lock on the mm_struct->mmap_lock.
  * - vmi must be positioned within [@vmg->middle->vm_start, @vmg->middle->vm_end).
  */
-static __must_check struct vm_area_struct *vma_merge_existing_range(
-		struct vma_merge_struct *vmg)
+struct vm_area_struct *vma_merge_existing_range(struct vma_merge_struct *vmg)
 {
 	struct vm_area_struct *middle = vmg->middle;
 	struct vm_area_struct *prev = vmg->prev;
@@ -1803,7 +1786,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
  */
 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks)
+	bool *need_rmap_locks, bool *relocate_anon)
 {
 	struct vm_area_struct *vma = *vmap;
 	unsigned long vma_start = vma->vm_start;
@@ -1837,7 +1820,19 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	vmg.middle = NULL; /* New VMA range. */
 	vmg.pgoff = pgoff;
 	vmg.next = vma_iter_next_rewind(&vmi, NULL);
+
 	new_vma = vma_merge_new_range(&vmg);
+	if (*relocate_anon) {
+		/*
+		 * If merge succeeds, no need to relocate. Otherwise, reset
+		 * pgoff for newly established VMA which we will relocate folios
+		 * to.
+		 */
+		if (new_vma)
+			*relocate_anon = false;
+		else
+			pgoff = addr >> PAGE_SHIFT;
+	}
 
 	if (new_vma) {
 		/*
@@ -1868,7 +1863,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		vma_set_range(new_vma, addr, addr + len, pgoff);
 		if (vma_dup_policy(vma, new_vma))
 			goto out_free_vma;
-		if (anon_vma_clone(new_vma, vma))
+		if (*relocate_anon)
+			new_vma->anon_vma = NULL;
+		else if (anon_vma_clone(new_vma, vma))
 			goto out_free_mempol;
 		if (new_vma->vm_file)
 			get_file(new_vma->vm_file);
@@ -1876,6 +1873,21 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_ops->open(new_vma);
 		if (vma_link(mm, new_vma))
 			goto out_vma_link;
+		/*
+		 * If we're attempting to relocate anonymous VMAs, we
+		 * don't want to reuse an anon_vma as set by
+		 * vm_area_dup(), or copy anon_vma_chain or anything
+		 * like this.
+		 */
+		if (*relocate_anon && __anon_vma_prepare(new_vma)) {
+			/*
+			 * We have already linked this VMA, so we must now unmap
+			 * it to unwind this. This is best effort.
+			 */
+			do_munmap(mm, addr, len, NULL);
+			return NULL;
+		}
+
 		*need_rmap_locks = false;
 	}
 	return new_vma;
@@ -3153,7 +3165,6 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock)
 	return ret;
 }
 
-
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
  * then i_mmap_rwsem is taken here.
@@ -3195,3 +3206,27 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 
 	return 0;
 }
+bool vma_maybe_has_shared_anon_folios(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	unsigned long expected_children;
+
+	/* Trivially fine. */
+	if (!anon_vma)
+		return false;
+
+	/* Currently or previously shares unCoW'd memory with parent(s). */
+	if (vma_had_uncowed_parents(vma))
+		return true;
+
+	/* mmap lock is sufficient as it would prevent num_children changing. */
+	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
+		anon_vma_assert_locked(anon_vma);
+
+	expected_children = 0;
+	/* The root anon_vma is self-parented. */
+	if (anon_vma == anon_vma->root)
+		expected_children++;
+
+	return anon_vma->num_children > expected_children;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 0db066e7a45d..f976da8f1b76 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -274,6 +274,9 @@ __must_check struct vm_area_struct
 __must_check struct vm_area_struct
 *vma_merge_new_range(struct vma_merge_struct *vmg);
 
+__must_check struct vm_area_struct
+*vma_merge_existing_range(struct vma_merge_struct *vmg);
+
 __must_check struct vm_area_struct
 *vma_merge_extend(struct vma_iterator *vmi,
 		  struct vm_area_struct *vma,
@@ -294,7 +297,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma);
 
 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks);
+	bool *need_rmap_locks, bool *relocate_anon);
 
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma);
 
@@ -512,6 +515,37 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
 	return next;
 }
 
+/*
+ * Is this VMA either the parent of forked processes or the child of a forking
+ * process which may possess an unCOW'd reference to a shared folio?
+ */
+bool vma_maybe_has_shared_anon_folios(struct vm_area_struct *vma);
+
+/*
+ * If, at any point, the VMA had unCoW'd mappings from parents, it will maintain
+ * more than one anon_vma_chain connecting it to more than one anon_vma. A merge
+ * would mean a wider range of folios sharing the root anon_vma lock, and thus
+ * potential lock contention, we do not wish to encourage merging such that this
+ * scales to a problem.
+ *
+ * Assumes VMA is locked.
+ */
+static inline bool vma_had_uncowed_parents(struct vm_area_struct *vma)
+{
+	/*
+	 * The list_is_singular() test is to avoid merging VMA cloned from
+	 * parents. This can improve scalability caused by anon_vma lock.
+	 */
+	return vma && vma->anon_vma && !list_is_singular(&vma->anon_vma_chain);
+}
+
+/*
+ * If, at any point, folios mapped by the VMA had unCoW'd mappings potentially
+ * present in child processes forked from this one, then the underlying mapped
+ * folios may be non-exclusively mapped.
+ */
+bool vma_had_uncowed_children(struct vm_area_struct *vma);
+
 #ifdef CONFIG_64BIT
 
 static inline bool vma_is_sealed(struct vm_area_struct *vma)
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 2be7597a2ac2..238acd4e20fd 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -1551,13 +1551,14 @@ static bool test_copy_vma(void)
 	unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
 	struct mm_struct mm = {};
 	bool need_locks = false;
+	bool relocate_anon = false;
 	VMA_ITERATOR(vmi, &mm, 0);
 	struct vm_area_struct *vma, *vma_new, *vma_next;
 
 	/* Move backwards and do not merge. */
 
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags);
-	vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks);
+	vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks, &relocate_anon);
 	ASSERT_NE(vma_new, vma);
 	ASSERT_EQ(vma_new->vm_start, 0);
 	ASSERT_EQ(vma_new->vm_end, 0x2000);
@@ -1570,7 +1571,7 @@ static bool test_copy_vma(void)
 
 	vma = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags);
 	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x8000, 6, flags);
-	vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks);
+	vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks, &relocate_anon);
 	vma_assert_attached(vma_new);
 
 	ASSERT_EQ(vma_new, vma_next);
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 77b2949d874a..636dd94ebdf0 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -26,6 +26,7 @@
 #include <linux/mm.h>
 #include <linux/rbtree.h>
 #include <linux/refcount.h>
+#include <linux/rwsem.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -196,6 +197,8 @@ struct anon_vma {
 	struct anon_vma *root;
 	struct rb_root_cached rb_root;
 
+	unsigned long num_children;
+
 	/* Test fields. */
 	bool was_cloned;
 	bool was_unlinked;
@@ -251,6 +254,8 @@ struct mm_struct {
 	unsigned long def_flags;
 
 	unsigned long flags; /* Must use atomic bitops to access */
+
+	struct rw_semaphore mmap_lock;
 };
 
 struct vm_area_struct;
@@ -1401,6 +1406,17 @@ static inline int ksm_execve(struct mm_struct *mm)
 	return 0;
 }
 
+static int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+		struct list_head *uf)
+{
+	(void)mm;
+	(void)start;
+	(void)len;
+	(void)uf;
+
+	return 0;
+}
+
 static inline void ksm_exit(struct mm_struct *mm)
 {
 	(void)mm;
@@ -1479,4 +1495,26 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
 	return vm_flags;
 }
 
+static inline int rwsem_is_locked(struct rw_semaphore *sem)
+{
+	(void)sem;
+
+	return 0;
+}
+
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	(void)anon_vma;
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	(void)anon_vma;
+}
+
+static inline void anon_vma_assert_locked(const struct anon_vma *anon_vma)
+{
+	(void)anon_vma;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/11] mm/mremap: add MREMAP_MUST_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 03/11] mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios Lorenzo Stoakes
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

This flag is the same as MREMAP_RELOCATE_ANON, however it returns an
-EFAULT error should folios not be able to be relocated.

The operation is undone when this occurs so the user can choose to proceed
without setting this flag at this stage.

This is useful for cases where a use case absolutely requires mergeability,
or moreover a user needs to know whether it succeeded or not for internal
bookkeeping purposes.

If the move would be a no-op (could be merged, or folios in range are
unmapped), then the operation proceeds normally.

It is only in instances where we would have fallen back to the usual
mremap() logic if we were using MREMAP_RELOCATE_ANON that we return -EFAULT
for MREMAP_MUST_RELOCATE_ANON.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/linux/mman.h |  9 +++++----
 mm/mremap.c               | 35 ++++++++++++++++++++++++++---------
 2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index d0542f872e0c..a61dbe1e8b2b 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -6,10 +6,11 @@
 #include <asm-generic/hugetlb_encode.h>
 #include <linux/types.h>
 
-#define MREMAP_MAYMOVE		1
-#define MREMAP_FIXED		2
-#define MREMAP_DONTUNMAP	4
-#define MREMAP_RELOCATE_ANON	8
+#define MREMAP_MAYMOVE			1
+#define MREMAP_FIXED			2
+#define MREMAP_DONTUNMAP		4
+#define MREMAP_RELOCATE_ANON		8
+#define MREMAP_MUST_RELOCATE_ANON	16
 
 #define OVERCOMMIT_GUESS		0
 #define OVERCOMMIT_ALWAYS		1
diff --git a/mm/mremap.c b/mm/mremap.c
index 2da064f8c898..41158aea8c29 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1385,14 +1385,18 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
  * is being moved to by updating index and mapping fields accordingly?
  */
 static bool should_relocate_anon(struct vma_remap_struct *vrm,
-	struct pagetable_move_control *pmc)
+	struct pagetable_move_control *pmc, int *errp)
 {
 	struct vm_area_struct *old = vrm->vma;
 
 	/* Currently we only do this if requested. */
-	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
+	if (!(vrm->flags & (MREMAP_RELOCATE_ANON | MREMAP_MUST_RELOCATE_ANON)))
 		return false;
 
+	/* Failures are fatal in the 'must' case. */
+	if (vrm->flags & MREMAP_MUST_RELOCATE_ANON)
+		*errp = -EFAULT;
+
 	/* We can't deal with special or hugetlb mappings. */
 	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
 		return false;
@@ -1401,14 +1405,17 @@ static bool should_relocate_anon(struct vma_remap_struct *vrm,
 	if (!vma_is_anonymous(old))
 		return false;
 
-	/* If no folios are mapped, then no need to attempt this. */
-	if (!old->anon_vma)
-		return false;
-
 	/* We don't allow relocation of non-exclusive folios. */
 	if (vma_maybe_has_shared_anon_folios(old))
 		return false;
 
+	/* Below issues are non-fatal in 'must' case. */
+	*errp = 0;
+
+	/* If no folios are mapped, then no need to attempt this. */
+	if (!old->anon_vma)
+		return false;
+
 	/* Otherwise, we're good to go! */
 	return true;
 }
@@ -1460,7 +1467,10 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	struct vm_area_struct *new_vma;
 	int err = 0;
 	PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len);
-	bool relocate_anon = should_relocate_anon(vrm, &pmc);
+	bool relocate_anon = should_relocate_anon(vrm, &pmc, &err);
+
+	if (err)
+		return err;
 
 again:
 	new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
@@ -1491,6 +1501,12 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 
 			do_munmap(current->mm, start, size, NULL);
 			relocate_anon = false;
+			if (vrm->flags & MREMAP_MUST_RELOCATE_ANON) {
+				vrm_uncharge(vrm);
+				*new_vma_ptr = NULL;
+				return -EFAULT;
+			}
+
 			goto again;
 		}
 	}
@@ -1840,7 +1856,7 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 
 	/* Ensure no unexpected flag values. */
 	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP |
-		      MREMAP_RELOCATE_ANON))
+		      MREMAP_RELOCATE_ANON | MREMAP_MUST_RELOCATE_ANON))
 		return -EINVAL;
 
 	/* Start address must be page-aligned. */
@@ -1856,7 +1872,8 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 		return -EINVAL;
 
 	/* We can't relocate without allowing a move. */
-	if ((flags & MREMAP_RELOCATE_ANON) && !(flags & MREMAP_MAYMOVE))
+	if ((flags & (MREMAP_RELOCATE_ANON | MREMAP_MUST_RELOCATE_ANON)) &&
+	     !(flags & MREMAP_MAYMOVE))
 		return -EINVAL;
 
 	/* Remainder of checks are for cases with specific new_addr. */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/11] mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 02/11] mm/mremap: add MREMAP_MUST_RELOCATE_ANON Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 04/11] tools UAPI: Update copy of linux/mman.h from the kernel sources Lorenzo Stoakes
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Larger folios are a challenge, as they might be mapped across multiple
VMAs, and can be mapped at a higher page table level (PUD, PMD) or also at
PTE level.

Handle them correctly by checking whether they are fully spanned by the VMA
we are examining. If so, then we can simply relocate the folio as we would
any other.

If not, then we must split the folio. If there is a higher level page table
level mapping the large folio directly then we must also split this.

This will be the minority of cases, and if the operation is performed on a
large, mapping will only be those folios at the start and end of the
mapping which the mapping is not aligned to.

The net result is that we are able to handle large folios mapped in any
form which might be encountered correctly.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 327 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 297 insertions(+), 30 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 41158aea8c29..d901438ae415 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -77,6 +77,7 @@ struct pte_state {
 	unsigned long new_addr;
 	unsigned long old_end;
 	pte_t *ptep;
+	pmd_t *pmdp;
 	spinlock_t *ptl;
 };
 
@@ -534,40 +535,67 @@ enum pgt_entry {
 	HPAGE_PUD,
 };
 
-/*
- * Returns an extent of the corresponding size for the pgt_entry specified if
- * valid. Else returns a smaller extent bounded by the end of the source and
- * destination pgt_entry.
- */
-static __always_inline unsigned long get_extent(enum pgt_entry entry,
-						struct pagetable_move_control *pmc)
+static void __get_mask_size(enum pgt_entry entry,
+		unsigned long *mask, unsigned long *size)
 {
-	unsigned long next, extent, mask, size;
-	unsigned long old_addr = pmc->old_addr;
-	unsigned long old_end = pmc->old_end;
-	unsigned long new_addr = pmc->new_addr;
-
 	switch (entry) {
 	case HPAGE_PMD:
 	case NORMAL_PMD:
-		mask = PMD_MASK;
-		size = PMD_SIZE;
+		*mask = PMD_MASK;
+		*size = PMD_SIZE;
 		break;
 	case HPAGE_PUD:
 	case NORMAL_PUD:
-		mask = PUD_MASK;
-		size = PUD_SIZE;
+		*mask = PUD_MASK;
+		*size = PUD_SIZE;
 		break;
 	default:
 		BUILD_BUG();
 		break;
 	}
+}
+
+/* Same as get extent, only ignores new address.  */
+static unsigned long __get_old_extent(struct pagetable_move_control *pmc,
+		unsigned long mask, unsigned long size)
+{
+	unsigned long next, extent;
+	unsigned long old_addr = pmc->old_addr;
+	unsigned long old_end = pmc->old_end;
 
 	next = (old_addr + size) & mask;
 	/* even if next overflowed, extent below will be ok */
 	extent = next - old_addr;
 	if (extent > old_end - old_addr)
 		extent = old_end - old_addr;
+
+	return extent;
+}
+
+static unsigned long get_old_extent(enum pgt_entry entry,
+		struct pagetable_move_control *pmc)
+{
+	unsigned long mask, size;
+
+	__get_mask_size(entry, &mask, &size);
+	return __get_old_extent(pmc, mask, size);
+}
+
+/*
+ * Returns an extent of the corresponding size for the pgt_entry specified if
+ * valid. Else returns a smaller extent bounded by the end of the source and
+ * destination pgt_entry.
+ */
+static __always_inline unsigned long get_extent(enum pgt_entry entry,
+						struct pagetable_move_control *pmc)
+{
+	unsigned long next, extent, mask, size;
+	unsigned long new_addr = pmc->new_addr;
+
+	__get_mask_size(entry, &mask, &size);
+
+	extent = __get_old_extent(pmc, mask, size);
+
 	next = (new_addr + size) & mask;
 	if (extent > next - new_addr)
 		extent = next - new_addr;
@@ -797,6 +825,165 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
 	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
 }
 
+/* Assumes folio lock is held. */
+static bool __relocate_large_folio(struct pagetable_move_control *pmc,
+		unsigned long old_addr, unsigned long new_addr,
+		struct folio *folio, bool undo)
+{
+	pgoff_t new_index;
+	struct vm_area_struct *old = pmc->old;
+	struct vm_area_struct *new = pmc->new;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+
+	/* no-op. */
+	if (!folio_test_anon(folio))
+		return true;
+
+	if (!undo)
+		new_index = linear_page_index(new, new_addr);
+	else
+		new_index = linear_page_index(old, old_addr);
+
+	/* See comment in relocate_anon_pte(). */
+	folio_move_anon_rmap(folio, undo ? old : new);
+	WRITE_ONCE(folio->index, new_index);
+	return true;
+}
+
+static bool relocate_large_folio(struct pagetable_move_control *pmc,
+		unsigned long old_addr, unsigned long new_addr,
+		struct folio *folio, bool undo)
+{
+	bool ret;
+
+	folio_lock(folio);
+
+	if (!folio_test_large(folio) || folio_test_ksm(folio)) {
+		ret = false;
+		goto out;
+	}
+
+	/* See relocate_anon_pte() for description. */
+	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio))) {
+		ret = false;
+		goto out;
+	}
+	if (folio_maybe_dma_pinned(folio)) {
+		ret = false;
+		goto out;
+	}
+
+	ret = __relocate_large_folio(pmc, old_addr, new_addr, folio, undo);
+
+out:
+	folio_unlock(folio);
+	return ret;
+}
+
+static bool relocate_anon_pud(struct pagetable_move_control *pmc,
+		pud_t *pudp, bool undo)
+{
+	spinlock_t *ptl;
+	pud_t pud;
+	struct folio *folio;
+	struct page *page;
+	bool ret;
+	unsigned long old_addr = pmc->old_addr;
+	unsigned long new_addr = pmc->new_addr;
+
+	VM_WARN_ON(old_addr & ~HPAGE_PUD_MASK);
+	VM_WARN_ON(new_addr & ~HPAGE_PUD_MASK);
+
+	ptl = pud_trans_huge_lock(pudp, pmc->old);
+	if (!ptl)
+		return false;
+
+	pud = pudp_get(pudp);
+	if (!pud_present(pud)) {
+		ret = true;
+		goto out;
+	}
+	if (!pud_leaf(pud)) {
+		ret = false;
+		goto out;
+	}
+
+	page = pud_page(pud);
+	if (!page) {
+		ret = true;
+		goto out;
+	}
+
+	folio = page_folio(page);
+	ret = relocate_large_folio(pmc, old_addr, new_addr, folio, undo);
+
+out:
+	spin_unlock(ptl);
+	return ret;
+}
+
+static bool relocate_anon_pmd(struct pagetable_move_control *pmc,
+		pmd_t *pmdp, bool undo)
+{
+	spinlock_t *ptl;
+	pmd_t pmd;
+	struct folio *folio;
+	bool ret;
+	unsigned long old_addr = pmc->old_addr;
+	unsigned long new_addr = pmc->new_addr;
+
+	VM_WARN_ON(old_addr & ~HPAGE_PMD_MASK);
+	VM_WARN_ON(new_addr & ~HPAGE_PMD_MASK);
+
+	ptl = pmd_trans_huge_lock(pmdp, pmc->old);
+	if (!ptl)
+		return false;
+
+	pmd = pmdp_get(pmdp);
+	if (!pmd_present(pmd)) {
+		ret = true;
+		goto out;
+	}
+	if (is_huge_zero_pmd(pmd)) {
+		ret = true;
+		goto out;
+	}
+	if (!pmd_leaf(pmd)) {
+		ret = false;
+		goto out;
+	}
+
+	folio = pmd_folio(pmd);
+	if (!folio) {
+		ret = true;
+		goto out;
+	}
+
+	ret = relocate_large_folio(pmc, old_addr, new_addr, folio, undo);
+out:
+	spin_unlock(ptl);
+	return ret;
+}
+
+/*
+ * Is the THP discovered at old_addr fully spanned at both the old and new VMAs?
+ */
+static bool is_thp_fully_spanned(struct pagetable_move_control *pmc,
+				 unsigned long old_addr,
+				 size_t thp_size)
+{
+	unsigned long old_end = pmc->old_end;
+	unsigned long orig_old_addr = old_end - pmc->len_in;
+	unsigned long aligned_start = old_addr & ~(thp_size - 1);
+	unsigned long aligned_end = aligned_start + thp_size;
+
+	if (aligned_start < orig_old_addr || aligned_end > old_end)
+		return false;
+
+	return true;
+}
+
 /*
  * If the folio mapped at the specified pte entry can have its index and mapping
  * relocated, then do so.
@@ -813,10 +1000,12 @@ static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
 	unsigned long ret = 1;
 	unsigned long old_addr = state->old_addr;
 	unsigned long new_addr = state->new_addr;
+	struct mm_struct *mm = current->mm;
 
 	old = pmc->old;
 	new = pmc->new;
 
+retry:
 	pte = ptep_get(state->ptep);
 
 	/* Ensure we have truly got an anon folio. */
@@ -853,13 +1042,55 @@ static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
 	if (folio_maybe_dma_pinned(folio))
 		goto out;
 
-	/*
-	 * This should not happen as we explicitly disallow this, but check
-	 * anyway.
-	 */
+	/* If a split huge PMD, try to relocate all at once. */
 	if (folio_test_large(folio)) {
-		ret = 0;
-		goto out;
+		size_t size = folio_size(folio);
+
+		if (is_thp_fully_spanned(pmc, old_addr, size) &&
+		    __relocate_large_folio(pmc, old_addr, new_addr, folio, undo)) {
+			VM_WARN_ON_ONCE(old_addr & (size - 1));
+			ret = folio_nr_pages(folio);
+			goto out;
+		} else {
+			int err;
+			struct anon_vma *anon_vma = folio_anon_vma(folio);
+
+			/*
+			 * If the folio has the anon_vma whose lock we hold, we
+			 * have a problem, as split_folio() will attempt to lock
+			 * the already-locked anon_vma causing a deadlock. In
+			 * this case, bail out.
+			 */
+			if (anon_vma->root == pmc->relocate_locked->anon_vma->root) {
+				ret = 0;
+				goto out;
+			}
+
+			/* split_folio() expects elevated refcount. */
+			folio_get(folio);
+
+			/*
+			 * We must relinquish/reacquire the PTE lock over this
+			 * operation. We hold the folio lock and an increased
+			 * reference count, so there's no danger of the folio
+			 * disappearing beneath us.
+			 */
+			pte_unmap_unlock(state->ptep, state->ptl);
+			err = split_folio(folio);
+			state->ptep = pte_offset_map_lock(mm, state->pmdp,
+							  old_addr, &state->ptl);
+			folio_unlock(folio);
+			folio_put(folio);
+
+			if (err || !state->ptep)
+				return 0;
+
+			/*
+			 * If we split, we need to look up the folio again, so
+			 * simply retry the operation.
+			 */
+			goto retry;
+		}
 	}
 
 	if (!undo)
@@ -906,6 +1137,7 @@ static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
 		.old_addr = pmc->old_addr,
 		.new_addr = pmc->new_addr,
 		.old_end = pmc->old_addr + extent,
+		.pmdp = pmdp,
 	};
 	pte_t *ptep_start;
 	bool ret;
@@ -955,29 +1187,64 @@ static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo
 		pmd_t pmd;
 		pud_t pud;
 
-		extent = get_extent(NORMAL_PUD, pmc);
+		extent = get_old_extent(NORMAL_PUD, pmc);
 
 		pudp = get_old_pud(mm, pmc->old_addr);
 		if (!pudp)
 			continue;
 		pud = pudp_get(pudp);
+		if (pud_trans_huge(pud)) {
+			unsigned long old_addr = pmc->old_addr;
+
+			if (extent != HPAGE_PUD_SIZE)
+				return false;
 
-		if (pud_trans_huge(pud) || pud_devmap(pud))
+			VM_WARN_ON_ONCE(old_addr & ~HPAGE_PUD_MASK);
+
+			/* We may relocate iff the new address is aligned. */
+			if (!(pmc->new_addr & ~HPAGE_PUD_MASK) &&
+			    is_thp_fully_spanned(pmc, old_addr, HPAGE_PUD_SIZE)) {
+				if (!relocate_anon_pud(pmc, pudp, undo))
+					return false;
+				continue;
+			}
+
+			/* Otherwise, we split so we can do this with PMDs/PTEs. */
+			split_huge_pud(pmc->old, pudp, old_addr);
+		} else if (pud_devmap(pud)) {
 			return false;
+		}
 
-		extent = get_extent(NORMAL_PMD, pmc);
+		extent = get_old_extent(NORMAL_PMD, pmc);
 		pmdp = get_old_pmd(mm, pmc->old_addr);
 		if (!pmdp)
 			continue;
 		pmd = pmdp_get(pmdp);
-
-		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
-		    pmd_devmap(pmd))
-			return false;
-
 		if (pmd_none(pmd))
 			continue;
 
+		if (pmd_trans_huge(pmd)) {
+			unsigned long old_addr = pmc->old_addr;
+
+			if (extent != HPAGE_PMD_SIZE)
+				return false;
+
+			VM_WARN_ON_ONCE(old_addr & ~HPAGE_PMD_MASK);
+
+			/* We may relocate iff the new address is aligned. */
+			if (!(pmc->new_addr & ~HPAGE_PMD_MASK) &&
+			    is_thp_fully_spanned(pmc, old_addr, HPAGE_PMD_SIZE)) {
+				if (!relocate_anon_pmd(pmc, pmdp, undo))
+					return false;
+				continue;
+			}
+
+			/* Otherwise, we split so we can do this with PTEs. */
+			split_huge_pmd(pmc->old, pmdp, old_addr);
+		} else if (is_swap_pmd(pmd) || pmd_devmap(pmd)) {
+			return false;
+		}
+
 		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
 			return false;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/11] tools UAPI: Update copy of linux/mman.h from the kernel sources
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 03/11] mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 05/11] tools/testing/selftests: add sys_mremap() helper to vm_util.h Lorenzo Stoakes
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Import newly introduced MREMAP_RELOCATE_ANON_* defines.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/include/uapi/linux/mman.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index e89d00528f2f..a61dbe1e8b2b 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -6,9 +6,11 @@
 #include <asm-generic/hugetlb_encode.h>
 #include <linux/types.h>
 
-#define MREMAP_MAYMOVE		1
-#define MREMAP_FIXED		2
-#define MREMAP_DONTUNMAP	4
+#define MREMAP_MAYMOVE			1
+#define MREMAP_FIXED			2
+#define MREMAP_DONTUNMAP		4
+#define MREMAP_RELOCATE_ANON		8
+#define MREMAP_MUST_RELOCATE_ANON	16
 
 #define OVERCOMMIT_GUESS		0
 #define OVERCOMMIT_ALWAYS		1
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/11] tools/testing/selftests: add sys_mremap() helper to vm_util.h
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 04/11] tools UAPI: Update copy of linux/mman.h from the kernel sources Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 06/11] tools/testing/selftests: add mremap() cases that merge normally Lorenzo Stoakes
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Add a helper to invoke the mremap() system call directly using
syscall(). This is useful as otherwise glibc and friends will filter out
newer flags like MREMAP_RELOCATE_ANON and MREMAP_MUST_RELOCATE_ANON thus
making it impossible to test this functionality.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/vm_util.c | 8 ++++++++
 tools/testing/selftests/mm/vm_util.h | 3 +++
 2 files changed, 11 insertions(+)

diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 5492e3f784df..1d434772fa54 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -524,3 +524,11 @@ int read_sysfs(const char *file_path, unsigned long *val)
 
 	return 0;
 }
+
+void *sys_mremap(void *old_address, unsigned long old_size,
+		 unsigned long new_size, int flags, void *new_address)
+{
+	return (void *)syscall(__NR_mremap, (unsigned long)old_address,
+			       old_size, new_size, flags,
+			       (unsigned long)new_address);
+}
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index b8136d12a0f8..797c24215b17 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -117,6 +117,9 @@ static inline void log_test_result(int result)
 	ksft_test_result_report(result, "%s\n", test_name);
 }
 
+void *sys_mremap(void *old_address, unsigned long old_size,
+		 unsigned long new_size, int flags, void *new_address);
+
 /*
  * On ppc64 this will only work with radix 2M hugepage size
  */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/11] tools/testing/selftests: add mremap() cases that merge normally
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 05/11] tools/testing/selftests: add sys_mremap() helper to vm_util.h Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 07/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases Lorenzo Stoakes
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Use a direct system call version of mremap() as, when we move to using
MREMAP_[MUST_]RELOCATE_ANON, the glibc wrapper will disallow this.

Also import linux/mman.h (which will amount to the local tools cache of
mman.h) to enusre these header values are available when later added.

Then, add tests asserting all the mremap() merge cases that function
correctly without MREMAP_[MUST_]RELOCATE_ANON.

This constitutes moving around unfaulted VMAs and moving around faulted
VMAs back into position immediately adjacent to VMAs also faulted in with
that moved VMA.

By doing so we provide a baseline set of expectations on mremap()
operations and VMA merge which we can expand upon for
MREMAP_[MUST_]RELOCATE_ANON cases in a subsequent commit.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/merge.c | 599 ++++++++++++++++++++++++++++-
 1 file changed, 597 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c
index 0ae77dae4737..b5c183403fe7 100644
--- a/tools/testing/selftests/mm/merge.c
+++ b/tools/testing/selftests/mm/merge.c
@@ -13,6 +13,7 @@
 #include <sys/wait.h>
 #include <linux/perf_event.h>
 #include "vm_util.h"
+#include <linux/mman.h>
 
 FIXTURE(merge)
 {
@@ -25,7 +26,7 @@ FIXTURE_SETUP(merge)
 {
 	self->page_size = psize();
 	/* Carve out PROT_NONE region to map over. */
-	self->carveout = mmap(NULL, 12 * self->page_size, PROT_NONE,
+	self->carveout = mmap(NULL, 30 * self->page_size, PROT_NONE,
 			      MAP_ANON | MAP_PRIVATE, -1, 0);
 	ASSERT_NE(self->carveout, MAP_FAILED);
 	/* Setup PROCMAP_QUERY interface. */
@@ -34,7 +35,7 @@ FIXTURE_SETUP(merge)
 
 FIXTURE_TEARDOWN(merge)
 {
-	ASSERT_EQ(munmap(self->carveout, 12 * self->page_size), 0);
+	ASSERT_EQ(munmap(self->carveout, 30 * self->page_size), 0);
 	ASSERT_EQ(close_procmap(&self->procmap), 0);
 	/*
 	 * Clear unconditionally, as some tests set this. It is no issue if this
@@ -573,4 +574,598 @@ TEST_F(merge, ksm_merge)
 	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size);
 }
 
+TEST_F(merge, mremap_unfaulted_to_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/* Offset ptr2 further away. */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * |  faulted  |  \  | unfaulted |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Now move ptr2 adjacent to ptr:
+	 *
+	 * |-----------|-----------|
+	 * |  faulted  | unfaulted |
+	 * |-----------|-----------|
+	 *      ptr         ptr2
+	 *
+	 * It should merge:
+	 *
+	 * |----------------------|
+	 * |       faulted        |
+	 * |----------------------|
+	 *            ptr
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+}
+
+TEST_F(merge, mremap_unfaulted_behind_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/* Offset ptr2 further away. */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * |  faulted  |  \  | unfaulted |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Now move ptr2 adjacent, but behind, ptr:
+	 *
+	 * |-----------|-----------|
+	 * | unfaulted |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr2        ptr
+	 *
+	 * It should merge:
+	 *
+	 * |----------------------|
+	 * |       faulted        |
+	 * |----------------------|
+	 *            ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr2));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size);
+}
+
+TEST_F(merge, mremap_unfaulted_between_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/* Offset ptr3 further away. */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/* Offset ptr2 further away. */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault in ptr, ptr3:
+	 *                \                 \
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 * |  faulted  |  \  | unfaulted |  \  |  faulted  |
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 *      ptr       \       ptr2      \       ptr3
+	 */
+	ptr[0] = 'x';
+	ptr3[0] = 'x';
+
+	/*
+	 * Move ptr3 back into place, leaving a place for ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                     ptr3      \       ptr2
+	 */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Finally, move ptr2 into place:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  | unfaulted |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *      ptr        ptr2         ptr3
+	 *
+	 * It should merge, but only ptr, ptr2:
+	 *
+	 * |-----------------------|-----------|
+	 * |        faulted        | unfaulted |
+	 * |-----------------------|-----------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr3));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr3);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr3 + 5 * page_size);
+}
+
+TEST_F(merge, mremap_unfaulted_between_faulted_unfaulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/* Offset ptr3 further away. */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+
+	/* Offset ptr2 further away. */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *                \                 \
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 * |  faulted  |  \  | unfaulted |  \  | unfaulted |
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 *      ptr       \       ptr2      \       ptr3
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Move ptr3 back into place, leaving a place for ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           | unfaulted |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                     ptr3      \       ptr2
+	 */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Finally, move ptr2 into place:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  | unfaulted | unfaulted |
+	 * |-----------|-----------|-----------|
+	 *      ptr        ptr2         ptr3
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_unfaulted_between_correctly_placed_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map one larger area:
+	 *
+	 * |-----------------------------------|
+	 * |            unfaulted              |
+	 * |-----------------------------------|
+	 */
+	ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Unmap middle:
+	 *
+	 * |-----------|           |-----------|
+	 * |  faulted  |           |  faulted  |
+	 * |-----------|           |-----------|
+	 *
+	 * Now the faulted areas are compatible with each other (anon_vma the
+	 * same, vma->vm_pgoff equal to virtual page offset).
+	 */
+	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
+
+	/*
+	 * Map a new area, ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                               \       ptr2
+	 */
+	ptr2 = mmap(&carveout[20 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Finally, move ptr2 into place:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  | unfaulted |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *      ptr        ptr2         ptr3
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_correct_placed_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map one larger area:
+	 *
+	 * |-----------------------------------|
+	 * |            unfaulted              |
+	 * |-----------------------------------|
+	 */
+	ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Offset the final and middle 5 pages further away:
+	 *                \                 \
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 * |  faulted  |  \  |  faulted  |  \  |  faulted  |
+	 * |-----------|  /  |-----------|  /  |-----------|
+	 *      ptr       \       ptr2      \       ptr3
+	 */
+	ptr3 = &ptr[10 * page_size];
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	ptr2 = &ptr[5 * page_size];
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Move ptr2 into its correct place:
+	 *                            \
+	 * |-----------|-----------|  /  |-----------|
+	 * |  faulted  |  faulted  |  \  |  faulted  |
+	 * |-----------|-----------|  /  |-----------|
+	 *      ptr         ptr2      \       ptr3
+	 *
+	 * It should merge:
+	 *                            \
+	 * |-----------------------|  /  |-----------|
+	 * |        faulted        |  \  |  faulted  |
+	 * |-----------------------|  /  |-----------|
+	 *            ptr             \       ptr3
+	 */
+
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+
+	/*
+	 * Now move ptr out of place:
+	 *                            \                 \
+	 *             |-----------|  /  |-----------|  /  |-----------|
+	 *             |  faulted  |  \  |  faulted  |  \  |  faulted  |
+	 *             |-----------|  /  |-----------|  /  |-----------|
+	 *                  ptr2      \       ptr       \       ptr3
+	 */
+	ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size,
+			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Now move ptr back into place:
+	 *                            \
+	 * |-----------|-----------|  /  |-----------|
+	 * |  faulted  |  faulted  |  \  |  faulted  |
+	 * |-----------|-----------|  /  |-----------|
+	 *      ptr         ptr2      \       ptr3
+	 *
+	 * It should merge:
+	 *                            \
+	 * |-----------------------|  /  |-----------|
+	 * |        faulted        |  \  |  faulted  |
+	 * |-----------------------|  /  |-----------|
+	 *            ptr             \       ptr3
+	 */
+	ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size,
+			 MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+
+	/*
+	 * Now move ptr out of place again:
+	 *                            \                 \
+	 *             |-----------|  /  |-----------|  /  |-----------|
+	 *             |  faulted  |  \  |  faulted  |  \  |  faulted  |
+	 *             |-----------|  /  |-----------|  /  |-----------|
+	 *                  ptr2      \       ptr       \       ptr3
+	 */
+	ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size,
+			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Now move ptr3 back into place:
+	 *                                        \
+	 *             |-----------|-----------|  /  |-----------|
+	 *             |  faulted  |  faulted  |  \  |  faulted  |
+	 *             |-----------|-----------|  /  |-----------|
+	 *                  ptr2        ptr3      \       ptr
+	 *
+	 * It should merge:
+	 *                                        \
+	 *             |-----------------------|  /  |-----------|
+	 *             |        faulted        |  \  |  faulted  |
+	 *             |-----------------------|  /  |-----------|
+	 *                        ptr2            \       ptr
+	 */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr2[5 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr2));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size);
+
+	/*
+	 * Now move ptr back into place:
+	 *
+	 * |-----------|-----------------------|
+	 * |  faulted  |        faulted        |
+	 * |-----------|-----------------------|
+	 *      ptr               ptr2
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 *                  ptr
+	 */
+	ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size,
+			 MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+
+	/*
+	 * Now move ptr2 out of the way:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                     ptr3      \       ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Now move it back:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  |  faulted  |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *      ptr         ptr2        ptr3
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 *                  ptr
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+
+	/*
+	 * Move ptr3 out of place:
+	 *                                        \
+	 * |-----------------------|              /  |-----------|
+	 * |        faulted        |              \  |  faulted  |
+	 * |-----------------------|              /  |-----------|
+	 *            ptr                         \       ptr3
+	 */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 1000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Now move it back:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  |  faulted  |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *      ptr         ptr2        ptr3
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 *                  ptr
+	 */
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
 TEST_HARNESS_MAIN
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 06/11] tools/testing/selftests: add mremap() cases that merge normally Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 08/11] tools/testing/selftests: expand mremap() tests for MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Add test cases to the mm self test asserting that the merge cases which the
newly introduced MREMAP[_MUST]_RELOCATE_ANON results in merges occurring as
expected, which otherwise without it would not succeed.

This extends the newly introduced VMA merge self tests for these cases and
exhaustively attempts each merge case, asserting expected behaviour.

We use the MREMAP_MUST_RELOCATE_ANON variant to ensure that, should the
anon relocate fail, we observe an error, as quietly demoting the move to
non-relocate anon would cause unusual test failures.

We carefully document each case to make clear what we are testing.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/merge.c | 730 +++++++++++++++++++++++++++++
 1 file changed, 730 insertions(+)

diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c
index b5c183403fe7..b658f2f3a94b 100644
--- a/tools/testing/selftests/mm/merge.c
+++ b/tools/testing/selftests/mm/merge.c
@@ -1168,4 +1168,734 @@ TEST_F(merge, mremap_correct_placed_faulted)
 	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
 }
 
+TEST_F(merge, mremap_relocate_anon_faulted_after_unfaulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away. Note we don't have to use
+	 * MREMAP_RELOCATE_ANON yet.
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault ptr2 in:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * | unfaulted |  \  |  faulted  |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 after ptr, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|
+	 * | unfaulted |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr         ptr2
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------|
+	 * |        faulted        |
+	 * |-----------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_before_unfaulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[12 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away. Note we don't have to use
+	 * MREMAP_RELOCATE_ANON yet.
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault ptr2 in:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * | unfaulted |  \  |  faulted  |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 before ptr, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|
+	 * |  faulted  | unfaulted |
+	 * |-----------|-----------|
+	 *      ptr2        ptr
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------|
+	 * |        faulted        |
+	 * |-----------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &carveout[page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr2));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_between_unfaulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away, and move ptr3 into position:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           | unfaulted |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Fault in ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           | unfaulted |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 between ptr, ptr3, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|-----------|
+	 * | unfaulted |  faulted  | unfaulted |
+	 * |-----------|-----------|-----------|
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_after_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away. Note we don't have to use
+	 * MREMAP_RELOCATE_ANON yet.
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault ptr and ptr2 in:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * |  faulted  |  \  |  faulted  |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr[0] = 'x';
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 after ptr, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|
+	 * |  faulted  |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr         ptr2
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------|
+	 * |        faulted        |
+	 * |-----------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_before_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[12 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away. Note we don't have to use
+	 * MREMAP_RELOCATE_ANON yet.
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault ptr, ptr2 in:
+	 *                \
+	 * |-----------|  /  |-----------|
+	 * |  faulted  |  \  |  faulted  |
+	 * |-----------|  /  |-----------|
+	 *      ptr       \       ptr2
+	 */
+	ptr[0] = 'x';
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 before ptr, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|
+	 * |  faulted  |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr2        ptr
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------|
+	 * |        faulted        |
+	 * |-----------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &carveout[page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr2));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_between_faulted_unfaulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away, and move ptr3 into position:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           | unfaulted |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Fault in ptr, ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           | unfaulted |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr[0] = 'x';
+	ptr2[0] = 'x';
+
+	/*
+	 * Move ptr2 between ptr, ptr3, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  |  faulted  | unfaulted |
+	 * |-----------|-----------|-----------|
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_between_unfaulted_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away, and move ptr3 into position:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           | unfaulted |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Fault in ptr2, ptr3:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           |  faulted  |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2[0] = 'x';
+	ptr3[0] = 'x';
+
+	/*
+	 * Move ptr2 between ptr, ptr3, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|-----------|
+	 * | unfaulted |  faulted  |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_between_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2, *ptr3;
+
+	/*
+	 * Map three distinct areas:
+	 *
+	 * |-----------|  |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|  |-----------|
+	 *      ptr            ptr2           ptr3
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Offset ptr2 further away, and move ptr3 into position:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * | unfaulted |           | unfaulted |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/*
+	 * Fault in ptr, ptr2, ptr3:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                    ptr3       \      ptr2
+	 */
+	ptr[0] = 'x';
+	ptr2[0] = 'x';
+	ptr3[0] = 'x';
+
+	/*
+	 * Move ptr2 between ptr, ptr3, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  |  faulted  |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *
+	 * It should merge, but only the latter two VMAs:
+	 *
+	 * |-----------|-----------------------|
+	 * |  faulted  |        faulted        |
+	 * |-----------|-----------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr2));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_faulted_between_correctly_placed_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+	/*
+	 * Map one larger area:
+	 *
+	 * |-----------------------------------|
+	 * |            unfaulted              |
+	 * |-----------------------------------|
+	 */
+	ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Fault in ptr:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr[0] = 'x';
+
+	/*
+	 * Unmap middle:
+	 *
+	 * |-----------|           |-----------|
+	 * |  faulted  |           |  faulted  |
+	 * |-----------|           |-----------|
+	 *
+	 * Now the faulted areas are compatible with each other (anon_vma the
+	 * same, vma->vm_pgoff equal to virtual page offset).
+	 */
+	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
+
+	/*
+	 * Map a new area, ptr2:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  | unfaulted |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                               \       ptr2
+	 */
+	ptr2 = mmap(&carveout[20 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault it in:
+	 *                                        \
+	 * |-----------|           |-----------|  /  |-----------|
+	 * |  faulted  |           |  faulted  |  \  |  faulted  |
+	 * |-----------|           |-----------|  /  |-----------|
+	 *      ptr                               \       ptr2
+	 */
+	ptr2[0] = 'x';
+
+	/*
+	 * Finally, move ptr2 into place, using MREMAP_MUST_RELOCATE_ANON:
+	 *
+	 * |-----------|-----------|-----------|
+	 * |  faulted  |  faulted  |  faulted  |
+	 * |-----------|-----------|-----------|
+	 *      ptr        ptr2         ptr3
+	 *
+	 * It should merge:
+	 *
+	 * |-----------------------------------|
+	 * |              faulted              |
+	 * |-----------------------------------|
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size);
+}
+
+TEST_F(merge, mremap_relocate_anon_mprotect_faulted_faulted)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	struct procmap_fd *procmap = &self->procmap;
+	char *ptr, *ptr2;
+
+
+	/*
+	 * Map two distinct areas:
+	 *
+	 * |-----------|  |-----------|
+	 * | unfaulted |  | unfaulted |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr2 = mmap(&carveout[12 * page_size], 5 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Fault in ptr, ptr2, mprotect() ptr2 read-only:
+	 *
+	 *      RW              RO
+	 * |-----------|  |-----------|
+	 * |  faulted  |  |  faulted  |
+	 * |-----------|  |-----------|
+	 *      ptr            ptr2
+	 */
+	ptr[0] = 'x';
+	ptr2[0] = 'x';
+	ASSERT_EQ(mprotect(ptr2, 5 * page_size, PROT_READ), 0);
+
+	/*
+	 * Move ptr2 next to ptr:
+	 *
+	 *      RW          RO
+	 * |-----------|-----------|
+	 * |  faulted  |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr        ptr2
+	 */
+	ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+			  &ptr[5 * page_size]);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/* No merge should happen. */
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 5 * page_size);
+
+	/*
+	 * Now mremap ptr2 RW:
+	 *
+	 *      RW          RW
+	 * |-----------|-----------|
+	 * |  faulted  |  faulted  |
+	 * |-----------|-----------|
+	 *      ptr        ptr2
+	 *
+	 * This should result in a merge:
+	 *
+	 *            RW
+	 * |-----------------------|
+	 * |        faulted        |
+	 * |-----------------------|
+	 *            ptr
+	 */
+	ASSERT_EQ(mprotect(ptr2, 5 * page_size, PROT_READ | PROT_WRITE), 0);
+
+	ASSERT_TRUE(find_vma_procmap(procmap, ptr));
+	ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
+}
+
 TEST_HARNESS_MAIN
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/11] tools/testing/selftests: expand mremap() tests for MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 07/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 09/11] tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Adjust every relevant test (that is, one that moves memory) to also perform
the same test using MREMAP_MUST_RELOCATE_ANON to assert that it is behaving
as expected.

In order to avoid glibc not being up-to-date, also move to using the
mremap() system call direct, and import the linux/mman.h header, which will
use the tool linux header wrappers, to get the latest mremap defines.

Also take careful precaution in the instance where we might unexpectedly
fail the 'mremap move within range' test due to large folios mapped outside
of the range we are relocating.

In these instances, if we test with MREMAP_MUST_RELOCATE_ANON, we ensure
the folios in question are not huge. If testing with MREMAP_RELOCATE_ANON
we do not - this asserts that this correctly falls back to non-relocate
anon behaviour.

In cases where MREMAP_MUST_RELOCATE_ANON is used, we attempt to immediately
trigger reclaim to also assert that the rmap state is uncorrupted.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/mremap_test.c | 262 +++++++++++++++--------
 1 file changed, 168 insertions(+), 94 deletions(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index bb84476a177f..5d6ff0d1da7d 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -8,11 +8,13 @@
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
+#include <linux/mman.h>
 #include <sys/mman.h>
 #include <time.h>
 #include <stdbool.h>
 
 #include "../kselftest.h"
+#include "vm_util.h"
 
 #define EXPECT_SUCCESS 0
 #define EXPECT_FAILURE 1
@@ -34,6 +36,7 @@ struct config {
 	unsigned long long dest_alignment;
 	unsigned long long region_size;
 	int overlapping;
+	bool use_relocate_anon;
 	unsigned int dest_preamble_size;
 };
 
@@ -60,7 +63,8 @@ enum {
 #define PTE page_size
 
 #define MAKE_TEST(source_align, destination_align, size,	\
-		  overlaps, should_fail, test_name)		\
+		  overlaps, use_relocate_anon, should_fail,	\
+		  test_name)					\
 (struct test){							\
 	.name = test_name,					\
 	.config = {						\
@@ -68,6 +72,7 @@ enum {
 		.dest_alignment = destination_align,		\
 		.region_size = size,				\
 		.overlapping = overlaps,			\
+		.use_relocate_anon = use_relocate_anon,		\
 	},							\
 	.expect_failure = should_fail				\
 }
@@ -184,6 +189,12 @@ static void *get_source_mapping(struct config c)
 	unsigned long long addr = 0ULL;
 	void *src_addr = NULL;
 	unsigned long long mmap_min_addr;
+	int mmap_flags = MAP_FIXED_NOREPLACE | MAP_ANONYMOUS;
+
+	if (c.use_relocate_anon)
+		mmap_flags |= MAP_PRIVATE;
+	else
+		mmap_flags |= MAP_SHARED;
 
 	mmap_min_addr = get_mmap_min_addr();
 	/*
@@ -198,8 +209,7 @@ static void *get_source_mapping(struct config c)
 		goto retry;
 
 	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-					MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
-					-1, 0);
+					mmap_flags, -1, 0);
 	if (src_addr == MAP_FAILED) {
 		if (errno == EPERM || errno == EEXIST)
 			goto retry;
@@ -251,7 +261,7 @@ static void mremap_expand_merge(FILE *maps_fp, unsigned long page_size)
 	}
 
 	munmap(start + page_size, page_size);
-	remap = mremap(start, page_size, 2 * page_size, 0);
+	remap = sys_mremap(start, page_size, 2 * page_size, 0, 0);
 	if (remap == MAP_FAILED) {
 		ksft_print_msg("mremap failed: %s\n", strerror(errno));
 		munmap(start, page_size);
@@ -292,7 +302,8 @@ static void mremap_expand_merge_offset(FILE *maps_fp, unsigned long page_size)
 
 	/* Unmap final page to ensure we have space to expand. */
 	munmap(start + 2 * page_size, page_size);
-	remap = mremap(start + page_size, page_size, 2 * page_size, 0);
+
+	remap = sys_mremap(start + page_size, page_size, 2 * page_size, 0, 0);
 	if (remap == MAP_FAILED) {
 		ksft_print_msg("mremap failed: %s\n", strerror(errno));
 		munmap(start, 2 * page_size);
@@ -324,20 +335,35 @@ static void mremap_expand_merge_offset(FILE *maps_fp, unsigned long page_size)
  *
  * |DDDDddddSSSSssss|
  */
-static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr)
+static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr,
+				     char *test_suffix, int extra_flags)
 {
 	char *test_name = "mremap mremap move within range";
 	void *src, *dest;
 	unsigned int i, success = 1;
-
 	size_t size = SIZE_MB(20);
 	void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	int mremap_flags = MREMAP_MAYMOVE | MREMAP_FIXED;
+
 	if (ptr == MAP_FAILED) {
 		perror("mmap");
 		success = 0;
 		goto out;
 	}
+
+	/*
+	 * If THP is enabled, we may end up spanning a range which has large
+	 * folios not enclosed within the mapping, which will disallow the
+	 * relocate.
+	 *
+	 * In this case, disallow huge pages in the range.
+	 */
+	if (extra_flags & MREMAP_MUST_RELOCATE_ANON)
+		madvise(ptr, size, MADV_NOHUGEPAGE);
+
+	mremap_flags |= extra_flags;
+
 	memset(ptr, 0, size);
 
 	src = ptr + SIZE_MB(6);
@@ -348,8 +374,8 @@ static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr)
 
 	dest = src - SIZE_MB(2);
 
-	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
-						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	void *new_ptr = sys_mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+				   mremap_flags, dest + SIZE_MB(1));
 	if (new_ptr == MAP_FAILED) {
 		perror("mremap");
 		success = 0;
@@ -375,9 +401,9 @@ static void mremap_move_within_range(unsigned int pattern_seed, char *rand_addr)
 		perror("munmap");
 
 	if (success)
-		ksft_test_result_pass("%s\n", test_name);
+		ksft_test_result_pass("%s%s\n", test_name, test_suffix);
 	else
-		ksft_test_result_fail("%s\n", test_name);
+		ksft_test_result_fail("%s%s\n", test_name, test_suffix);
 }
 
 /* Returns the time taken for the remap on success else returns -1. */
@@ -390,6 +416,10 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	long long  start_ns, end_ns, align_mask, ret, offset;
 	unsigned long long threshold;
 	unsigned long num_chunks;
+	int mremap_flags = MREMAP_MAYMOVE | MREMAP_FIXED;
+
+	if (c.use_relocate_anon)
+		mremap_flags |= MREMAP_MUST_RELOCATE_ANON;
 
 	if (threshold_mb == VALIDATION_NO_THRESHOLD)
 		threshold = c.region_size;
@@ -431,10 +461,15 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	}
 
 	if (c.dest_preamble_size) {
+		int mmap_flags = MAP_FIXED_NOREPLACE | MAP_ANONYMOUS;
+
+		if (c.use_relocate_anon)
+			mmap_flags |= MAP_PRIVATE;
+		else
+			mmap_flags |= MAP_SHARED;
+
 		dest_preamble_addr = mmap((void *) addr - c.dest_preamble_size, c.dest_preamble_size,
-					  PROT_READ | PROT_WRITE,
-					  MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
-							-1, 0);
+					  PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
 		if (dest_preamble_addr == MAP_FAILED) {
 			ksft_print_msg("Failed to map dest preamble region: %s\n",
 					strerror(errno));
@@ -447,8 +482,8 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	}
 
 	clock_gettime(CLOCK_MONOTONIC, &t_start);
-	dest_addr = mremap(src_addr, c.region_size, c.region_size,
-					  MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
+	dest_addr = sys_mremap(src_addr, c.region_size, c.region_size,
+			       mremap_flags, (char *) addr);
 	clock_gettime(CLOCK_MONOTONIC, &t_end);
 
 	if (dest_addr == MAP_FAILED) {
@@ -549,6 +584,10 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
  * subsequent tests. So we clean up mappings after each test.
  */
 clean_up_dest:
+	/* Trigger reclaim to assert that adjusted rmap state is valid. */
+	if (c.use_relocate_anon)
+		madvise(dest_addr, c.region_size, MADV_PAGEOUT);
+
 	munmap(dest_addr, c.region_size);
 clean_up_dest_preamble:
 	if (c.dest_preamble_size && dest_preamble_addr)
@@ -565,16 +604,19 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
  * down address landed on a mapping that maybe does not exist.
  */
 static void mremap_move_1mb_from_start(unsigned int pattern_seed,
-				       char *rand_addr)
+				       char *rand_addr, char *test_suffix,
+				       int extra_flags)
 {
 	char *test_name = "mremap move 1mb from start at 1MB+256KB aligned src";
 	void *src = NULL, *dest = NULL;
 	unsigned int i, success = 1;
-
+	int mremap_flags = MREMAP_MAYMOVE | MREMAP_FIXED;
 	/* Config to reuse get_source_mapping() to do an aligned mmap. */
 	struct config c = {
 		.src_alignment = SIZE_MB(1) + SIZE_KB(256),
-		.region_size = SIZE_MB(6)
+		.region_size = SIZE_MB(6),
+		.use_relocate_anon = extra_flags & (MREMAP_RELOCATE_ANON |
+						    MREMAP_MUST_RELOCATE_ANON),
 	};
 
 	src = get_source_mapping(c);
@@ -583,6 +625,12 @@ static void mremap_move_1mb_from_start(unsigned int pattern_seed,
 		goto out;
 	}
 
+	/* See comment in mremap_move_within_range(). */
+	if (extra_flags & MREMAP_MUST_RELOCATE_ANON)
+		madvise(src, c.region_size, MADV_NOHUGEPAGE);
+
+	mremap_flags |= extra_flags;
+
 	c.src_alignment = SIZE_MB(1) + SIZE_KB(256);
 	dest = get_source_mapping(c);
 	if (!dest) {
@@ -599,8 +647,8 @@ static void mremap_move_1mb_from_start(unsigned int pattern_seed,
 	 */
 	munmap(dest, SIZE_MB(1));
 
-	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
-						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	void *new_ptr = sys_mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+				   mremap_flags, dest + SIZE_MB(1));
 	if (new_ptr == MAP_FAILED) {
 		perror("mremap");
 		success = 0;
@@ -629,9 +677,10 @@ static void mremap_move_1mb_from_start(unsigned int pattern_seed,
 		perror("munmap dest");
 
 	if (success)
-		ksft_test_result_pass("%s\n", test_name);
+		ksft_test_result_pass("%s%s\n", test_name, test_suffix);
+
 	else
-		ksft_test_result_fail("%s\n", test_name);
+		ksft_test_result_fail("%s%s\n", test_name, test_suffix);
 }
 
 static void run_mremap_test_case(struct test test_case, int *failures,
@@ -640,13 +689,17 @@ static void run_mremap_test_case(struct test test_case, int *failures,
 {
 	long long remap_time = remap_region(test_case.config, threshold_mb,
 					    rand_addr);
+	char *relocate_anon_suffix = " [MREMAP_MUST_RELOCATE_ANON]";
+	struct config *c = &test_case.config;
 
 	if (remap_time < 0) {
 		if (test_case.expect_failure)
-			ksft_test_result_xfail("%s\n\tExpected mremap failure\n",
-					      test_case.name);
+			ksft_test_result_xfail("%s%s\n\tExpected mremap failure\n",
+					       test_case.name,
+					       c->use_relocate_anon ? relocate_anon_suffix : "");
 		else {
-			ksft_test_result_fail("%s\n", test_case.name);
+			ksft_test_result_fail("%s%s\n", test_case.name,
+					      c->use_relocate_anon ? relocate_anon_suffix : "");
 			*failures += 1;
 		}
 	} else {
@@ -656,10 +709,13 @@ static void run_mremap_test_case(struct test test_case, int *failures,
 		 */
 		if (threshold_mb == VALIDATION_NO_THRESHOLD ||
 		    test_case.config.region_size <= threshold_mb * _1MB)
-			ksft_test_result_pass("%s\n\tmremap time: %12lldns\n",
-					      test_case.name, remap_time);
+			ksft_test_result_pass("%s%s\n\tmremap time: %12lldns\n",
+					      test_case.name,
+					      c->use_relocate_anon ? relocate_anon_suffix : "",
+					      remap_time);
 		else
-			ksft_test_result_pass("%s\n", test_case.name);
+			ksft_test_result_pass("%s%s\n", test_case.name,
+					      c->use_relocate_anon ? relocate_anon_suffix : "");
 	}
 }
 
@@ -703,8 +759,8 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
-#define MAX_TEST 15
-#define MAX_PERF_TEST 3
+#define MAX_TEST 30
+#define MAX_PERF_TEST 6
 int main(int argc, char **argv)
 {
 	int failures = 0;
@@ -721,12 +777,15 @@ int main(int argc, char **argv)
 	char *rand_addr;
 	size_t rand_size;
 	int num_expand_tests = 2;
-	int num_misc_tests = 2;
+	int num_misc_tests = 6;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
 	time_t t;
 	FILE *maps_fp;
+	bool use_relocate_anon = false;
+	struct test *test_case = test_cases;
+	struct test *perf_test_case = perf_test_cases;
 
 	pattern_seed = (unsigned int) time(&t);
 
@@ -763,66 +822,71 @@ int main(int argc, char **argv)
 
 	page_size = sysconf(_SC_PAGESIZE);
 
-	/* Expected mremap failures */
-	test_cases[0] =	MAKE_TEST(page_size, page_size, page_size,
-				  OVERLAPPING, EXPECT_FAILURE,
-				  "mremap - Source and Destination Regions Overlapping");
-
-	test_cases[1] = MAKE_TEST(page_size, page_size/4, page_size,
-				  NON_OVERLAPPING, EXPECT_FAILURE,
-				  "mremap - Destination Address Misaligned (1KB-aligned)");
-	test_cases[2] = MAKE_TEST(page_size/4, page_size, page_size,
-				  NON_OVERLAPPING, EXPECT_FAILURE,
-				  "mremap - Source Address Misaligned (1KB-aligned)");
-
-	/* Src addr PTE aligned */
-	test_cases[3] = MAKE_TEST(PTE, PTE, PTE * 2,
-				  NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "8KB mremap - Source PTE-aligned, Destination PTE-aligned");
-
-	/* Src addr 1MB aligned */
-	test_cases[4] = MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned");
-	test_cases[5] = MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
-
-	/* Src addr PMD aligned */
-	test_cases[6] = MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "4MB mremap - Source PMD-aligned, Destination PTE-aligned");
-	test_cases[7] =	MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned");
-	test_cases[8] = MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "4MB mremap - Source PMD-aligned, Destination PMD-aligned");
-
-	/* Src addr PUD aligned */
-	test_cases[9] = MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "2GB mremap - Source PUD-aligned, Destination PTE-aligned");
-	test_cases[10] = MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				   "2GB mremap - Source PUD-aligned, Destination 1MB-aligned");
-	test_cases[11] = MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				   "2GB mremap - Source PUD-aligned, Destination PMD-aligned");
-	test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
-
-	/* Src and Dest addr 1MB aligned. 5MB mremap. */
-	test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
-
-	/* Src and Dest addr 1MB aligned. 5MB mremap. */
-	test_cases[14] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				  "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
-	test_cases[14].config.dest_preamble_size = 10 * _4MB;
-
-	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
-	/*
-	 * mremap 1GB region - Page table level aligned time
-	 * comparison.
-	 */
-	perf_test_cases[1] = MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				       "1GB mremap - Source PMD-aligned, Destination PMD-aligned");
-	perf_test_cases[2] = MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-				       "1GB mremap - Source PUD-aligned, Destination PUD-aligned");
+	do {
+		/* Expected mremap failures */
+		*test_case++ =	MAKE_TEST(page_size, page_size, page_size,
+					  OVERLAPPING, use_relocate_anon, EXPECT_FAILURE,
+					  "mremap - Source and Destination Regions Overlapping");
+
+		*test_case++ =	MAKE_TEST(page_size, page_size/4, page_size,
+					  NON_OVERLAPPING, use_relocate_anon, EXPECT_FAILURE,
+					  "mremap - Destination Address Misaligned (1KB-aligned)");
+		*test_case++ =	MAKE_TEST(page_size/4, page_size, page_size,
+					  NON_OVERLAPPING, use_relocate_anon, EXPECT_FAILURE,
+					  "mremap - Source Address Misaligned (1KB-aligned)");
+
+		/* Src addr PTE aligned */
+		*test_case++ =	MAKE_TEST(PTE, PTE, PTE * 2,
+					  NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "8KB mremap - Source PTE-aligned, Destination PTE-aligned");
+
+		/* Src addr 1MB aligned */
+		*test_case++ =	MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned");
+		*test_case++ =	MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+		/* Src addr PMD aligned */
+		*test_case++ =	MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "4MB mremap - Source PMD-aligned, Destination PTE-aligned");
+		*test_case++ =	MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned");
+		*test_case++ =	MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "4MB mremap - Source PMD-aligned, Destination PMD-aligned");
+
+		/* Src addr PUD aligned */
+		*test_case++ =	MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					  "2GB mremap - Source PUD-aligned, Destination PTE-aligned");
+		*test_case++ =	MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					   "2GB mremap - Source PUD-aligned, Destination 1MB-aligned");
+		*test_case++ =	MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					   "2GB mremap - Source PUD-aligned, Destination PMD-aligned");
+		*test_case++ =	MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
+
+		/* Src and Dest addr 1MB aligned. 5MB mremap. */
+		*test_case++ =	MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					   "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+		/* Src and Dest addr 1MB aligned. 5MB mremap. */
+		*test_case =	MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					   "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
+		test_case++->config.dest_preamble_size = 10 * _4MB;
+
+		*perf_test_case++ =	 MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING,
+						   use_relocate_anon, EXPECT_SUCCESS,
+						"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
+		/*
+		 * mremap 1GB region - Page table level aligned time
+		 * comparison.
+		 */
+		*perf_test_case++ =	MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					       "1GB mremap - Source PMD-aligned, Destination PMD-aligned");
+		*perf_test_case++ =	MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, use_relocate_anon, EXPECT_SUCCESS,
+					       "1GB mremap - Source PUD-aligned, Destination PUD-aligned");
+
+		use_relocate_anon = !use_relocate_anon;
+	} while (use_relocate_anon);
 
 	run_perf_tests =  (threshold_mb == VALIDATION_NO_THRESHOLD) ||
 				(threshold_mb * _1MB >= _1GB);
@@ -846,8 +910,18 @@ int main(int argc, char **argv)
 
 	fclose(maps_fp);
 
-	mremap_move_within_range(pattern_seed, rand_addr);
-	mremap_move_1mb_from_start(pattern_seed, rand_addr);
+	mremap_move_within_range(pattern_seed, rand_addr,
+				 "", 0);
+	mremap_move_within_range(pattern_seed, rand_addr,
+				 "[MREMAP_RELOCATE_ANON]", MREMAP_RELOCATE_ANON);
+	mremap_move_within_range(pattern_seed, rand_addr,
+				 "[MREMAP_MUST_RELOCATE_ANON]", MREMAP_MUST_RELOCATE_ANON);
+	mremap_move_1mb_from_start(pattern_seed, rand_addr,
+				   "", 0);
+	mremap_move_1mb_from_start(pattern_seed, rand_addr,
+				   "[MREMAP_RELOCATE_ANON]", MREMAP_RELOCATE_ANON);
+	mremap_move_1mb_from_start(pattern_seed, rand_addr,
+				   "[MREMAP_MUST_RELOCATE_ANON]", MREMAP_MUST_RELOCATE_ANON);
 
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/11] tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 08/11] tools/testing/selftests: expand mremap() tests for MREMAP_RELOCATE_ANON Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 10/11] tools/testing/selftests: test relocate anon in split huge page test Lorenzo Stoakes
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

It is useful to have the CoW self-test invoke MREMAP_RELOCATE_ANON on
partial THP mappings, as this triggers folio split code paths and asserts
that this behaves correctly.

Add an additional set of tests to explicitly do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/cow.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index dbbcc5eb3dce..c483bfd4269e 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -845,13 +845,14 @@ enum thp_run {
 	THP_RUN_SINGLE_PTE,
 	THP_RUN_SINGLE_PTE_SWAPOUT,
 	THP_RUN_PARTIAL_MREMAP,
+	THP_RUN_PARTIAL_MREMAP_RELOCATE_ANON,
 	THP_RUN_PARTIAL_SHARED,
 };
 
 static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
 {
 	char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
-	size_t size, mmap_size, mremap_size;
+	size_t size, mmap_size, mremap_size, mremap_flags;
 	int ret;
 
 	/* For alignment purposes, we need twice the thp size. */
@@ -927,6 +928,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
 		size = pagesize;
 		break;
 	case THP_RUN_PARTIAL_MREMAP:
+	case THP_RUN_PARTIAL_MREMAP_RELOCATE_ANON:
 		/*
 		 * Remap half of the THP. We need some new memory location
 		 * for that.
@@ -939,8 +941,13 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
 			log_test_result(KSFT_FAIL);
 			goto munmap;
 		}
-		tmp = mremap(mem + mremap_size, mremap_size, mremap_size,
-			     MREMAP_MAYMOVE | MREMAP_FIXED, mremap_mem);
+
+		mremap_flags = MREMAP_MAYMOVE | MREMAP_FIXED;
+		if (thp_run == THP_RUN_PARTIAL_MREMAP_RELOCATE_ANON)
+			mremap_flags |= MREMAP_RELOCATE_ANON;
+
+		tmp = sys_mremap(mem + mremap_size, mremap_size, mremap_size,
+				 mremap_flags, mremap_mem);
 		if (tmp != mremap_mem) {
 			ksft_perror("mremap() failed");
 			log_test_result(KSFT_FAIL);
@@ -1052,6 +1059,13 @@ static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t siz
 	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
 }
 
+static void run_with_partial_mremap_relocate_anon_thp(test_fn fn, const char *desc, size_t size)
+{
+	ksft_print_msg("[RUN] %s ... with partially mremap(MREMAP_RELOCATE_ANON)'ed THP (%zu kB)\n",
+		desc, size / 1024);
+	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP_RELOCATE_ANON, size);
+}
+
 static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
 {
 	log_test_start("%s ... with partially shared THP (%zu kB)",
@@ -1247,6 +1261,7 @@ static void run_anon_test_case(struct test_case const *test_case)
 		run_with_single_pte_of_thp(test_case->fn, test_case->desc, size);
 		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, size);
 		run_with_partial_mremap_thp(test_case->fn, test_case->desc, size);
+		run_with_partial_mremap_relocate_anon_thp(test_case->fn, test_case->desc, size);
 		run_with_partial_shared_thp(test_case->fn, test_case->desc, size);
 
 		thp_pop_settings();
@@ -1270,7 +1285,7 @@ static int tests_per_anon_test_case(void)
 {
 	int tests = 2 + nr_hugetlbsizes;
 
-	tests += 6 * nr_thpsizes;
+	tests += 7 * nr_thpsizes;
 	if (pmdsize)
 		tests += 2;
 	return tests;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/11] tools/testing/selftests: test relocate anon in split huge page test
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (8 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 09/11] tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-09 13:26 ` [PATCH 11/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests Lorenzo Stoakes
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

It's useful to explicitly test splitting of huge pages with
MREMAP_RELOCATE_ANON set, as this exercises the undo logic and ensures that
it functions correctly.

Expand the tests to do so in the instance where anon mremap() occurs, and
utilise the shared sys_mremap() function to allow for specification of the
new mremap flag (which would otherwise be filtered by glibc).

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 .../selftests/mm/split_huge_page_test.c       | 25 +++++++++++++------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index aa7400ed0e99..1fb0c7e0318e 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -19,6 +19,7 @@
 #include <malloc.h>
 #include <stdbool.h>
 #include <time.h>
+#include <linux/mman.h>
 #include "vm_util.h"
 #include "../kselftest.h"
 
@@ -180,7 +181,7 @@ void split_pmd_thp_to_order(int order)
 	free(one_page);
 }
 
-void split_pte_mapped_thp(void)
+void split_pte_mapped_thp(bool relocate_anon)
 {
 	char *one_page, *pte_mapped, *pte_mapped2;
 	size_t len = 4 * pmd_pagesize;
@@ -221,10 +222,14 @@ void split_pte_mapped_thp(void)
 
 	/* remap the Nth pagesize of Nth THP */
 	for (i = 1; i < 4; i++) {
-		pte_mapped2 = mremap(one_page + pmd_pagesize * i + pagesize * i,
-				     pagesize, pagesize,
-				     MREMAP_MAYMOVE|MREMAP_FIXED,
-				     pte_mapped + pagesize * i);
+		int mremap_flags = MREMAP_MAYMOVE|MREMAP_FIXED;
+
+		if (relocate_anon)
+			mremap_flags |= MREMAP_RELOCATE_ANON;
+
+		pte_mapped2 = sys_mremap(one_page + pmd_pagesize * i + pagesize * i,
+					 pagesize, pagesize, mremap_flags,
+					 pte_mapped + pagesize * i);
 		if (pte_mapped2 == MAP_FAILED)
 			ksft_exit_fail_msg("mremap failed: %s\n", strerror(errno));
 	}
@@ -257,7 +262,10 @@ void split_pte_mapped_thp(void)
 	if (thp_size)
 		ksft_exit_fail_msg("Still %ld THPs not split\n", thp_size);
 
-	ksft_test_result_pass("Split PTE-mapped huge pages successful\n");
+	if (relocate_anon)
+		ksft_test_result_pass("Split PTE-mapped huge pages w/MREMAP_RELOCATE_ANON successful\n");
+	else
+		ksft_test_result_pass("Split PTE-mapped huge pages successful\n");
 	munmap(one_page, len);
 	close(pagemap_fd);
 	close(kpageflags_fd);
@@ -534,7 +542,7 @@ int main(int argc, char **argv)
 	if (argc > 1)
 		optional_xfs_path = argv[1];
 
-	ksft_set_plan(1+8+1+9+9+8*4+2);
+	ksft_set_plan(1+8+1+1+9+9+8*4+2);
 
 	pagesize = getpagesize();
 	pageshift = ffs(pagesize) - 1;
@@ -550,7 +558,8 @@ int main(int argc, char **argv)
 		if (i != 1)
 			split_pmd_thp_to_order(i);
 
-	split_pte_mapped_thp();
+	split_pte_mapped_thp(/* relocate_anon= */false);
+	split_pte_mapped_thp(/* relocate_anon= */true);
 	for (i = 0; i < 9; i++)
 		split_file_backed_thp(i);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (9 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 10/11] tools/testing/selftests: test relocate anon in split huge page test Lorenzo Stoakes
@ 2025-06-09 13:26 ` Lorenzo Stoakes
  2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Add tests explicitly asserting that mremap() fails on forked VMAs whether
they have parent anon_vma's or whether they have child ones, as these are
cases where the folio might be mapped by multiple processes - in this case
we don't even try to relocate folio metadata, but rather simply disallow
the operation.

The tests use MREMAP_MUST_RELOCATE_ANON so we can detect the failure
correctly.

Were the mremap()'s to succeed, a merge would occur, so it remains
appropriate to keep these in the merge test suite.

We also explicitly test the anon_vma reuse case. This is one where a
munmap() occurs on a parent anon_vma which still has children. We keep the
empty anon_vma around, and then attempt to reuse it on the next fork of a
VMA whose anon_vma references this empty anon_vma.

Consider the first case over 3 forks:

FORK 3 TIMES ONE AFTER ANOTHER
==============================

Process 1

     |-------------|
     | avc*        v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|
     | anon_vma    ^
     |             |
     v     rb_root |
  |---------------------|
  | refcount = 1        |<--|
  | num_children = 1    |   |
  | num_active_vmas = 1 |---| parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|
     | anon_vma    ^
     |             |--------------------|
     v     rb_root |                    |
  |---------------------|               |
  | refcount = 2        | (self-parent) |
  | num_children = 2    |<--------------x----|
  | num_active_vmas = 1 |               |    |
  |---------------------|               |    |
                           |------------|    |
Process 2                  |                 |
                           |                 |
     |-------------|       |                 |
     | avc         v       v                 |
  |-----|   vma |-----| |-----|              |
  |  B  |<------| avc |.| avc |              |
  |-----|       |-----| |-----|              |
     | anon_vma    ^                         |
     |             |                         |
     v     rb_root |                         |
  |---------------------|                    |
  | refcount = 1        |                    |
  | num_children = 0    |--------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|        |--------------------------|
     | anon_vma    ^           |                          |
     |             |--------------------|                 |
     v     rb_root |                    |                 |
  |---------------------|               |                 |
  | refcount = 3        | (self-parent) |                 |
  | num_children = 2    |<--------------x----|            |
  | num_active_vmas = 1 |               |    |            |
  |---------------------|               |    |            |
                           |------------|    |            |
Process 2                  |                 |            |
                           |                 |            |
     |-------------|       |                 |            |
     | avc         v       v                 |            |
  |-----|   vma |-----| |-----|              |            |
  |  B  |<------| avc |.| avc |              |            |
  |-----|       |-----| |-----|              |            |
     | anon_vma    ^                         |            |
     |             |-------------------------x--------|   |
     v     rb_root |                         |        |   |
  |---------------------|                    |        |   |
  | refcount = 1        |<-------------------x----|   |   |
  | num_children = 1    |--------------------|    |   |   |
  | num_active_vmas = 1 | parent                  |   |   |
  |---------------------|                         |   |   |
                                                  |   |   |
Process 3                                         |   |   |
                           |----------------------x---|   |
     |-------------|       |       |--------------x-------|
     | avc         v       v       v              |
  |-----|   vma |-----| |-----| |-----|           |
  |  C  |<------| avc |.| avc |.| avc |           |
  |-----|       |-----| |-----| |-----|           |
     | anon_vma    ^                              |
     |             |                              |
     v     rb_root |                              |
  |---------------------|                         |
  | refcount = 1        |                         |
  | num_children = 0    |-------------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|             |---------------------------------|
  |  A  |<------| avc |             |                                 |
  |-----|       |-----|        |--------------------------|           |
     | anon_vma    ^           |                          |           |
     |             |--------------------|                 |           |
     v     rb_root |                    |                 |           |
  |---------------------|               |                 |           |
  | refcount = 4        | (self-parent) |                 |           |
  | num_children = 2    |<--------------x----|            |           |
  | num_active_vmas = 1 |               |    |            |           |
  |---------------------|               |    |            |           |
                           |------------|    |            |           |
Process 2                  |                 |            |           |
                           |                 |            |           |
     |-------------|       |                 |            |           |
     | avc         v       v                 |            |           |
  |-----|   vma |-----| |-----|              |            |           |
  |  B  |<------| avc |.| avc |              |            |           |
  |-----|       |-----| |-----|     |--------x------------x-------|   |
     | anon_vma    ^                |        |            |       |   |
     |             |-------------------------x--------|   |       |   |
     v     rb_root |                         |        |   |       |   |
  |---------------------|                    |        |   |       |   |
  | refcount = 1        |<-------------------x----|   |   |       |   |
  | num_children = 1    |--------------------|    |   |   |       |   |
  | num_active_vmas = 1 | parent                  |   |   |       |   |
  |---------------------|                         |   |   |       |   |
                                                  |   |   |       |   |
Process 3                                         |   |   |       |   |
                           |----------------------x---|   |       |   |
     |-------------|       |       |--------------x-------|       |   |
     | avc         v       v       v              |               |   |
  |-----|   vma |-----| |-----| |-----|           |               |   |
  |  C  |<------| avc |.| avc |.| avc |           |               |   |
  |-----|       |-----| |-----| |-----|           |               |   |
     | anon_vma    ^                              |               |   |
     |             |------------------------------x-----------|   |   |
     v     rb_root |                              |           |   |   |
  |---------------------|                         |           |   |   |
  | refcount = 1        |<------------------------x---|       |   |   |
  | num_children = 1    |-------------------------|   |       |   |   |
  | num_active_vmas = 1 | parent                      |       |   |   |
  |---------------------|                             |       |   |   |
                                                      |       |   |   |
Process 4                                             |       |   |   |
                           |--------------------------x-------|   |   |
     |-------------|       |       |------------------x-----------|   |
     | avc         v       v       v       v----------x---------------|
  |-----|   vma |-----| |-----| |-----| |-----|       |
  |  D  |<------| avc |.| avc |.| avc |.| avc |       |
  |-----|       |-----| |-----| |-----| |-----|       |
     | anon_vma    ^                                  |
     |             |                                  |
     v     rb_root |                                  |
  |---------------------|                             |
  | refcount = 1        |                             |
  | num_children = 0    |-----------------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

We can see that at no point do we lack either a raised num_children count
or anon_vma_chain list count.

Equally with anon_vma reuse:

FORK 3 TIMES ONE AFTER ANOTHER, UNMAPPING AFTER FORK FOR ANON_VMA REUSE
=======================================================================

Process 1

     |-------------|
     | avc*        v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|
     | anon_vma    ^
     |             |
     v     rb_root |
  |---------------------|
  | refcount = 1        |<--|
  | num_children = 1    |   |
  | num_active_vmas = 1 |---| parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|
     | anon_vma    ^
     |             |--------------------|
     v     rb_root |                    |
  |---------------------|               |
  | refcount = 2        | (self-parent) |
  | num_children = 2    |<--------------x----|
  | num_active_vmas = 1 |               |    |
  |---------------------|               |    |
                           |------------|    |
Process 2                  |                 |
                           |                 |
     |-------------|       |                 |
     | avc         v       v                 |
  |-----|   vma |-----| |-----|              |
  |  B  |<------| avc |.| avc |              |
  |-----|       |-----| |-----|              |
     | anon_vma    ^                         |
     |             |                         |
     v     rb_root |                         |
  |---------------------|                    |
  | refcount = 1        |                    |
  | num_children = 0    |--------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|        |--------------------------|
     | anon_vma    ^           |                          |
     |             |--------------------|                 |
     v     rb_root |                    |                 |
  |---------------------|               |                 |
  | refcount = 3        | (self-parent) |                 |
  | num_children = 2    |<--------------x----|            |
  | num_active_vmas = 1 |               |    |            |
  |---------------------|               |    |            |
                           |------------|    |            |
Process 2                  |                 |            |
                           |                 |            |
     |-------------|       |                 |            |
     | avc         v       v                 |            |
  |-----|   vma |-----| |-----|              |            |
  |  B  |<------| avc |.| avc |              |            |
  |-----|       |-----| |-----|              |            |
     | anon_vma    ^                         |            |
     |             |-------------------------x--------|   |
     v     rb_root |                         |        |   |
  |---------------------|                    |        |   |
  | refcount = 1        |<-------------------x----|   |   |
  | num_children = 1    |--------------------|    |   |   |
  | num_active_vmas = 1 | parent                  |   |   |
  |---------------------|                         |   |   |
                                                  |   |   |
Process 3                                         |   |   |
                           |----------------------x---|   |
     |-------------|       |       |--------------x-------|
     | avc         v       v       v              |
  |-----|   vma |-----| |-----| |-----|           |
  |  C  |<------| avc |.| avc |.| avc |           |
  |-----|       |-----| |-----| |-----|           |
     | anon_vma    ^                              |
     |             |                              |
     v     rb_root |                              |
  |---------------------|                         |
  | refcount = 1        |                         |
  | num_children = 0    |-------------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

UNMAP B

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|
  |  A  |<------| avc |
  |-----|       |-----|        |--------------------------|
     | anon_vma    ^           |                          |
     |             |------------                          |
     v     rb_root |                                      |
  |---------------------|                                 |
  | refcount = 3        | (self-parent)                   |
  | num_children = 2    |<-------------------|            |
  | num_active_vmas = 1 |                    |            |
  |---------------------|                    |            |
                                             |            |
Process 2                                    |            |
                                             |            |
                   |-------------------------x--------|   |
           rb_root |                         |        |   |
  |---------------------|                    |        |   |
  | refcount = 1        |<-------------------x----|   |   | We keep empty
  | num_children = 1    |--------------------|    |   |   | anon_vma round.
  | num_active_vmas = 0 | parent                  |   |   |
  |---------------------|                         |   |   |
                                                  |   |   |
Process 3                                         |   |   |
                           |----------------------x---|   |
     |-------------|       |       |--------------x-------|
     | avc         v       v       v              |
  |-----|   vma |-----| |-----| |-----|           |
  |  C  |<------| avc |.| avc |.| avc |           |
  |-----|       |-----| |-----| |-----|           |
     | anon_vma    ^                              |
     |             |                              |
     v     rb_root |                              |
  |---------------------|                         |
  | refcount = 1        |                         |
  | num_children = 0    |-------------------------|
  | num_active_vmas = 1 | parent
  |---------------------|

FORK

Process 1

     |-------------|
     | avc         v
  |-----|   vma |-----|             |-----------------------------|
  |  A  |<------| avc |             |                             |
  |-----|       |-----|        |--------------------------|       |
     | anon_vma    ^           |                          |       |
     |             |-----------|                          |       |
     v     rb_root |                                      |       |
  |---------------------|                                 |       |
  | refcount = 3        | (self-parent)                   |       |
  | num_children = 2    |<-------------------|            |       |
  | num_active_vmas = 1 |                    |            |       |
  |---------------------|                    |            |       |
                                             |            |       |
Process 2               |--------------------x------------x-------x-------|
                        |                    |            |       |       |
                   |-------------------------x--------|   |       |       |
           rb_root |                         |        |   |       |       |
  |---------------------|<-------------------x--------x---x-------x---|   |
  | refcount = 1        |<-------------------x----|   |   |       |   |   |
  | num_children = 1    |--------------------|    |   |   |       |   |   |
  | num_active_vmas = 1 | parent                  |   |   |       |   |   |
  |---------------------|                         |   |   |       |   |   |
                                                  |   |   |       |   |   |
Process 3                                         |   |   |       |   |   |
                           |----------------------x---|   |       |   |   |
     |-------------|       |       |--------------x-------|       |   |   |
     | avc         v       v       v              |               |   |   |
  |-----|   vma |-----| |-----| |-----|           |               |   |   |
  |  C  |<------| avc |.| avc |.| avc |           |               |   |   |
  |-----|       |-----| |-----| |-----|           |               |   |   |
     | anon_vma    ^                              |               |   |   |
     |             |------------------------------x-----------|   |   |   |
     v     rb_root |                              |           |   |   |   |
  |---------------------|                         |           |   |   |   |
  | refcount = 1        |                         |           |   |   |   |
  | num_children = 0    |-------------------------|           |   |   |   |
  | num_active_vmas = 1 | parent                              |   |   |   |
  |---------------------|                                     |   |   |   |
                                                              |   |   |   |
Process 4                                                     |   |   |   |
                           |----------------------------------|   |   |   |
     |-------------|       |       |------------------------------|   |   |
     | avc         v       v       v                                  |   |
  |-----|   vma |-----| |-----| |-----|                               |   |
  |  D  |<------| avc |.| avc |.| avc |                               |   |
  |-----|       |-----| |-----| |-----|                               |   |
     | anon_vma    ^                                                  |   |
     |             |--------------------------------------------------x---|
     |                                                                |
     |----------------------------------------------------------------|

We reuse the empty anon_vma from VMA B. Note that process 3 is now parented
to process 4's (and 2's) anon_vma.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/merge.c | 361 +++++++++++++++++++++++++++++
 1 file changed, 361 insertions(+)

diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c
index b658f2f3a94b..f03ee15cbfc5 100644
--- a/tools/testing/selftests/mm/merge.c
+++ b/tools/testing/selftests/mm/merge.c
@@ -4,6 +4,7 @@
 #include "../kselftest_harness.h"
 #include <linux/prctl.h>
 #include <fcntl.h>
+#include <errno.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -15,11 +16,18 @@
 #include "vm_util.h"
 #include <linux/mman.h>
 
+enum poll_action {
+	POLL_TASK_RUN,
+	POLL_TASK_WAIT,
+	POLL_TASK_EXIT,
+};
+
 FIXTURE(merge)
 {
 	unsigned int page_size;
 	char *carveout;
 	struct procmap_fd procmap;
+	volatile enum poll_action *ipc;
 };
 
 FIXTURE_SETUP(merge)
@@ -31,6 +39,11 @@ FIXTURE_SETUP(merge)
 	ASSERT_NE(self->carveout, MAP_FAILED);
 	/* Setup PROCMAP_QUERY interface. */
 	ASSERT_EQ(open_self_procmap(&self->procmap), 0);
+
+	/* Quick and dirty IPC. */
+	self->ipc = (volatile enum poll_action *)mmap(NULL, self->page_size,
+			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANON, -1, 0);
+	ASSERT_NE(self->ipc, MAP_FAILED);
 }
 
 FIXTURE_TEARDOWN(merge)
@@ -42,6 +55,7 @@ FIXTURE_TEARDOWN(merge)
 	 * fails (KSM may be disabled for instance).
 	 */
 	prctl(PR_SET_MEMORY_MERGE, 0, 0, 0, 0);
+	ASSERT_EQ(munmap((void *)self->ipc, self->page_size), 0);
 }
 
 TEST_F(merge, mprotect_unfaulted_left)
@@ -1898,4 +1912,351 @@ TEST_F(merge, mremap_relocate_anon_mprotect_faulted_faulted)
 	ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size);
 }
 
+TEST_F(merge, mremap_relocate_anon_single_fork)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	volatile enum poll_action *poll = self->ipc;
+	char *ptr, *ptr2;
+	pid_t pid2;
+	int err;
+
+	/*
+	 * .           .           .
+	 * . |-------| .           .  Map A, fault in and
+	 * . |   A   |-.-----|     .  fork process 1 to
+	 * . |-------| .     |     .  process 2.
+	 * .           .     v     .
+	 * .           . |-------| .
+	 * .           . |   B   | .
+	 * .           . |-------| .
+	 * .           .           .
+	 * . Process 1 . Process 2 .
+	 */
+	ptr = mmap(carveout, 3 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	/* Fault it in. */
+	ptr[0] = 'x';
+
+	pid2 = fork();
+	ASSERT_NE(pid2, -1);
+	/* Parent process. */
+	if (pid2 != 0) {
+		/* mremap() fails due to forked children. */
+		ptr2 = sys_mremap(ptr, page_size, page_size,
+				  MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+				  &carveout[3 * page_size]);
+		err = errno;
+		ASSERT_EQ(ptr2, MAP_FAILED);
+		ASSERT_EQ(err, EFAULT);
+
+		poll[0] = POLL_TASK_EXIT;
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 2. */
+
+	/* mremap() fails due to forked parents. */
+	ptr2 = sys_mremap(ptr, page_size, page_size,
+		MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+		&carveout[3 * page_size]);
+	err = errno;
+	ASSERT_EQ(ptr2, MAP_FAILED);
+	ASSERT_EQ(err, EFAULT);
+
+	/* Wait for parent to finish. */
+	while (poll[0] == POLL_TASK_RUN)
+		;
+}
+
+TEST_F(merge, mremap_relocate_anon_fork_twice)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	volatile enum poll_action *poll = self->ipc;
+	char *ptr, *ptr2;
+	pid_t pid2, pid3;
+	int err;
+
+	/*
+	 * .           .           .
+	 * . |-------| .           .  Map A, fault in and
+	 * . |   A   |-.-----|     .  fork process 1 to
+	 * . |-------| .     |     .  process 2.
+	 * .           .     v     .
+	 * .           . |-------| .
+	 * .           . |   B   | .
+	 * .           . |-------| .
+	 * .           .           .
+	 * . Process 1 . Process 2 .
+	 */
+	ptr = mmap(carveout, 3 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	/* Fault it in. */
+	ptr[0] = 'x';
+	pid2 = fork();
+	ASSERT_NE(pid2, -1);
+	/* If parent process, simply wait. */
+	if (pid2 != 0) {
+		while (true) {
+			if (poll[0] == POLL_TASK_EXIT)
+				break;
+			if (poll[0] == POLL_TASK_WAIT)
+				continue;
+
+			/* mremap() fails due to forked children. */
+			ptr2 = sys_mremap(ptr, page_size, page_size,
+					  MREMAP_MAYMOVE | MREMAP_FIXED |
+					  MREMAP_MUST_RELOCATE_ANON,
+					  &carveout[3 * page_size]);
+			err = errno;
+			ASSERT_EQ(ptr2, MAP_FAILED);
+			ASSERT_EQ(err, EFAULT);
+
+			/* Strictly, should be atomic. */
+			if (poll[0] == POLL_TASK_RUN)
+				poll[0] = POLL_TASK_WAIT;
+		}
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 2. */
+
+	/* Wait for parent to finish. */
+	while (poll[0] == POLL_TASK_RUN)
+		;
+
+	/*
+	 * .           .           .           .
+	 * . |-------| .           .           .
+	 * . |   A   | .           .           .
+	 * . |-------| .           .           .
+	 * .           .           .           .
+	 * .           . |-------| .           . Fork process 2 to
+	 * .           . |   B   |-.-----|     . process 3.
+	 * .           . |-------| .     |     .
+	 * .           .           .     v     .
+	 * .           .           . |-------| .
+	 * .           .           . |   C   | .
+	 * .           .           . |-------| .
+	 * .           .           .           .
+	 * . Process 1 . Process 2 . Process 3 .
+	 */
+	pid3 = fork();
+	ASSERT_NE(pid3, -1);
+	/* If parent process, simply wait. */
+	if (pid3 != 0) {
+		/* mremap() fails due to forked children. */
+		ptr2 = sys_mremap(ptr, page_size, page_size,
+				  MREMAP_MAYMOVE | MREMAP_FIXED |
+				  MREMAP_MUST_RELOCATE_ANON,
+				  &carveout[3 * page_size]);
+		err = errno;
+		ASSERT_EQ(ptr2, MAP_FAILED);
+		ASSERT_EQ(err, EFAULT);
+
+		/* We don't retrigger, so just indicate we're done. */
+		poll[1] = POLL_TASK_EXIT;
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 3. */
+
+	/* Trigger root mremap(). */
+	poll[0] = POLL_TASK_RUN;
+	/* Wait for parents to finish. */
+
+	while (poll[0] == POLL_TASK_RUN)
+		;
+	while (poll[1] == POLL_TASK_RUN)
+		;
+
+	/* mremap() fails due to forked parents. */
+	ptr2 = sys_mremap(ptr, page_size, page_size,
+		MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+		&carveout[3 * page_size]);
+	err = errno;
+	ASSERT_EQ(ptr2, MAP_FAILED);
+	ASSERT_EQ(err, EFAULT);
+	/* Kill waiting parent. */
+	poll[0] = POLL_TASK_EXIT;
+}
+
+TEST_F(merge, mremap_relocate_anon_3_times_reuse_anon_vma)
+{
+	unsigned int page_size = self->page_size;
+	char *carveout = self->carveout;
+	volatile enum poll_action *poll = self->ipc;
+	char *ptr, *ptr2;
+	pid_t pid2, pid3, pid4;
+	int err;
+
+	/*
+	 * .           .           .
+	 * . |-------| .           .  Map A, fault in and
+	 * . |   A   |-.-----|     .  fork process 1 to
+	 * . |-------| .     |     .  process 2.
+	 * .           .     v     .
+	 * .           . |-------| .
+	 * .           . |   B   | .
+	 * .           . |-------| .
+	 * .           .           .
+	 * . Process 1 . Process 2 .
+	 */
+	ptr = mmap(carveout, 3 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	/* Fault it in. */
+	ptr[0] = 'x';
+	pid2 = fork();
+	ASSERT_NE(pid2, -1);
+	/* If parent process, simply wait. */
+	if (pid2 != 0) {
+		while (true) {
+			if (poll[0] == POLL_TASK_EXIT)
+				break;
+			if (poll[0] == POLL_TASK_WAIT)
+				continue;
+
+			/* mremap() fails due to forked children. */
+			ptr2 = sys_mremap(ptr, page_size, page_size,
+					  MREMAP_MAYMOVE | MREMAP_FIXED |
+					  MREMAP_MUST_RELOCATE_ANON,
+					  &carveout[3 * page_size]);
+			err = errno;
+			ASSERT_EQ(ptr2, MAP_FAILED);
+			ASSERT_EQ(err, EFAULT);
+
+			if (poll[0] == POLL_TASK_RUN)
+				poll[0] = POLL_TASK_WAIT;
+		}
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 2. */
+
+	/* Wait for parent to finish. */
+	while (poll[0] == POLL_TASK_RUN)
+		;
+
+	/*
+	 * .           .           .           .
+	 * . |-------| .           .           .
+	 * . |   A   | .           .           .
+	 * . |-------| .           .           .
+	 * .           .           .           .
+	 * .           . |-------| .           . Fork process 2 to
+	 * .           . |   B   |-.-----|     . process 3.
+	 * .           . |-------| .     |     .
+	 * .           .           .     v     .
+	 * .           .           . |-------| .
+	 * .           .           . |   C   | .
+	 * .           .           . |-------| .
+	 * .           .           .           .
+	 * . Process 1 . Process 2 . Process 3 .
+	 */
+	pid3 = fork();
+	ASSERT_NE(pid3, -1);
+	/* If parent process, simply wait. */
+	if (pid3 != 0) {
+		/*
+		 * We only try to mremap once, before unmapping so we can
+		 * trigger reuse of B's anon_vma.
+		 */
+		/* mremap() fails due to forked children. */
+		ptr2 = sys_mremap(ptr, page_size, page_size,
+				  MREMAP_MAYMOVE | MREMAP_FIXED |
+				  MREMAP_MUST_RELOCATE_ANON,
+				  &carveout[3 * page_size]);
+		err = errno;
+		ASSERT_EQ(ptr2, MAP_FAILED);
+		ASSERT_EQ(err, EFAULT);
+
+		/*
+		 * .           .           .           .
+		 * . |-------| .           .           .
+		 * . |   A   | .           .           .
+		 * . |-------| .           .           .
+		 * .           .           .           .
+		 * .           .           .           . Unmap VMA B, but
+		 * .           .           .           . anon_vma is left
+		 * .           .           .           . around.
+		 * .           .           .           .
+		 * .           .           . |-------| .
+		 * .           .           . |   C   | .
+		 * .           .           . |-------| .
+		 * .           .           .           .
+		 * . Process 1 . Process 2 . Process 3 .
+		 */
+		munmap(ptr, 3 * page_size);
+
+		/* We indicate we're done so child waits for */
+		poll[1] = POLL_TASK_EXIT;
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 3. */
+
+	/* Trigger root mremap(). */
+	poll[0] = POLL_TASK_RUN;
+	/* Wait for parents to finish. */
+	while (poll[0] == POLL_TASK_RUN)
+		;
+	while (poll[1] == POLL_TASK_RUN)
+		;
+
+	pid4 = fork();
+	ASSERT_NE(pid4, -1);
+
+	if (pid4 != 0) {
+		/* mremap() fails due to forked children. */
+		ptr2 = sys_mremap(ptr, page_size, page_size,
+				  MREMAP_MAYMOVE | MREMAP_FIXED |
+				  MREMAP_MUST_RELOCATE_ANON,
+				  &carveout[3 * page_size]);
+		err = errno;
+		ASSERT_EQ(ptr2, MAP_FAILED);
+		ASSERT_EQ(err, EFAULT);
+
+		/* We don't retrigger, so just indicate we're done. */
+		poll[2] = POLL_TASK_EXIT;
+
+		wait(NULL);
+		return;
+	}
+
+	/* This is process 4. */
+
+	/* Trigger root mremap(). */
+	poll[0] = POLL_TASK_RUN;
+	/* We unmapped VMA B, so nothing to trigger there. */
+	/* Wait for parents to finish. */
+	while (poll[0] == POLL_TASK_RUN)
+		;
+	while (poll[2] == POLL_TASK_RUN)
+		;
+
+	/* mremap() fails due to forked parents. */
+	ptr2 = sys_mremap(ptr, page_size, page_size,
+		MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_MUST_RELOCATE_ANON,
+		&carveout[3 * page_size]);
+	err = errno;
+	ASSERT_EQ(ptr2, MAP_FAILED);
+	ASSERT_EQ(err, EFAULT);
+	/* Kill waiting parent. */
+	poll[0] = POLL_TASK_EXIT;
+}
+
 TEST_HARNESS_MAIN
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (10 preceding siblings ...)
  2025-06-09 13:26 ` [PATCH 11/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests Lorenzo Stoakes
@ 2025-06-16 20:24 ` David Hildenbrand
  2025-06-16 20:41   ` David Hildenbrand
  2025-06-17 10:50   ` Lorenzo Stoakes
  2025-06-17  5:42 ` Lai, Yi
  2025-06-25 15:44 ` Lorenzo Stoakes
  13 siblings, 2 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-16 20:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Pedro Falcato, Rik van Riel, Harry Yoo, Zi Yan,
	Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena,
	Wei Yang, Barry Song, linux-mm, linux-kernel

Hi Lorenzo,

as discussed offline, there is a lot going on an this is rather ... a 
lot of code+complexity for something that is more a corner cases. :)

Corner-case as in: only select user space will benefit from this, which 
is really a shame.

After your presentation at LSF/MM, I thought about this further, and I 
was wondering whether:

(a) We cannot make this semi-automatic, avoiding flags.

(b) We cannot simplify further by limiting it to the common+easy cases 
first.

I think you already to some degree did b) as part of this non-RFC, which 
is great.


So before digging into the details, let's discuss the high level problem 
briefly.

I think there are three parts to it:

(1) Detecting whether it is safe to adjust the folio->index (small
     folios)

(2) Performance implications of doing so

(3) Detecting whether it is safe to adjust the folio->index (large PTE-
     mapped  folios)


Regarding (1), if we simply track whether a folio was ever used for 
COW-sharing, it would be very easy: and not only for present folios, but 
for any anon folios that are referenced by swap/migration entries. 
Skimming over patch #1, I think you apply a similar logic, which is good.

Regarding (2), it would apply when we mremap() anon VMAs and they happen 
to reside next to other anon VMAs. Which workloads are we concerned 
about harming by implementing this optimization? I recall that the most 
common use case for mremap() is actually for file mappings, but I might 
be wrong. In any case, we could just have a different way to enable this 
optimization than for each and every mremap() invocation in a process.

Regarding (3), if we were to split large folios that cross VMA 
boundaries during mremap(), it would be simpler.

How is it handled in this series if we large folio crosses VMA 
boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).


> This also creates a difference in behaviour, often surprising to users,
> between mappings which are faulted and those which are not - as for the
> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> 
> This is problematic firstly because this proliferates kernel allocations
> that are pure memory pressure - unreclaimable and unmovable -
> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
 > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> remaps which span multiple VMAs (though it does permit remaps that
> constitute a part of a single VMA).

If I mremap() to create a hole and mremap() it back, I would assume to 
automatically get the hole closed again, without special flags. Well, we 
both know this is not the case :)

 > > This means that a user must concern themselves with whether merges 
succeed
> or not should they wish to use mremap() in such a way which causes multiple
> mremap() calls to be performed upon mappings.

Right.

> 
> This series provides users with an option to accept the overhead of
> actually updating the VMA and underlying folios via the
> MREMAP_RELOCATE_ANON flag.

Okay. I wish we could avoid this flag ...

> 
> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> the mremap() succeeding, then no attempt is made at relocation of folios as
> this is not required.

Makes sense. This is the existing behavior then.

> 
> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> folio->index fields are appropriately updated in order that subsequent
> mremap() or mprotect() calls will succeed in merging.

By looking at the surrounding VMAs or simply by trying to always keep 
the folio->index to correspond to the address in the VMA? (just if 
mremap() never happened, I assume?)

> 
> This flag falls back to the ordinary means of mremap() should the operation
> not be feasible. It also transparently undoes the operation, carefully
> holding rmap locks such that no racing rmap operation encounters incorrect
> or missing VMAs.

I absolutely dislike this undo operation, really. :(

I hope we can find a way to just detect early whether this optimization 
would work.

Which are the exact error cases you can run into for un-doing?

I assume:

(a) cow-shared anon folio (can detect early)

(b) large folios crossing VMAs (TBD)

(c) KSM folios? Probably we could move them, I *think* we would have to 
update the ksm_rmap_item. Alternatively, we could indicate if a VMA had 
any KSM folios and give up early in the first version.

(d) GUP pins: I think we could allow that ... folio_maybe_dma_pinned() 
is racy either way (GUP-fast!). To deal with GUP-fast we would have to 
play different games ...

Anything else?

> 
> In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> user needs to know whether or not the operation succeeded - this flag is
> identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> the mremap() fails with -EFAULT.

How would an APP deal with these errors? Do you have a user in mind that 
could do something sensible based on this error?

I'm having a hard time imagining that :)

> 
> Note that no-op mremap() operations (such as an unpopulated range, or a
> merge that would trivially succeed already) will succeed under
> MREMAP_MUST_RELOCATE_ANON.
> 
> mremap() already walks page tables, so it isn't an order of magntitude
> increase in workload, but constitutes the need to walk to page table leaf
> level and manipulate folios.

Only for anon VMAs, though. Do you have some numbers how bad it is? I 
mean, mremap() is already a pretty invasive/expensive operation ... :) 
... which is why people started using uffdio_move instead, to avoid  the 
heavy-weight locks.

> 
> The operations all succeed under THP and in general are compatible with
> underlying large folios of any size. In fact, the larger the folio, the
> more efficient the operation is.

Yes.

> 
> Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> on the same order of magnitude of ordinary mremap() operations, with both
> exhibiting time to the proportion of the mapping which is populated.
> 
> Of course, mremap() operations that are entirely aligned are significantly
> faster as they need only move a VMA and a smaller number of higher order
> page tables, but this is unavoidable.
> 
> Previous efforts in this area
> =============================
> 
> An approach addressing this issue was previously suggested by Jakub Matena
> in a series posted a few years ago in [0] (and discussed in a masters
> thesis).
> 
> However this was a more general effort which attempted to always make
> anonymous mappings more mergeable, and therefore was not quite ready for
> the upstream limelight. In addition, large folio work which has occurred
> since requires us to carefully consider and account for this.
> 
> This series is more conservative and targeted (one must specific a flag to
> get this behaviour) and additionally goes to great efforts to handle large
> folios and account all of the nitty gritty locking concerns that might
> arise in current kernel code.
> 
> Thanks goes out to Jakub for his efforts however, and hopefully this effort
> to take a slightly different approach to the same problem is pleasing to
> him regardless :)
> 
> [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> 
> Use-cases
> =========
> 
> * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
>    upon which makes use of extensive mremap() operations to perform
>    defragmentation of objects, taking advantage of the plentiful available
>    virtual address space in a 64-bit system.
> 
>    In instances where one VMA is faulted in and another not, merging is not
>    possible, which leads to significant, unreclaimable, kernel metadata
>    overhead and contention on the vm.max_map_count limit.
> 
>    This series eliminates the issue entirely.
> * It was indicated that Android similarly moves memory around and
>    encounters the very same issues as ZGC.

Isn't Android using uffdio_move?

> * SUSE indicate they have encountered similar issues as pertains to an
>    internal client.
> 
> Past approaches
> ===============
> 
> In discussions at LSF/MM/BPF It was suggested that we could make this an
> madvise() operation, however at this point it will be too late to correctly
> perform the merge, requiring an unmap/remap which would be egregious.
> 
> It was further suggested that we simply defer the operation to the point at
> which an mremap() is attempted on multiple immediately adjacent VMAs (that
> is - to allow VMA fragmentation up until the point where it might cause
> perceptible issues with uAPI).
> 
> This is problematic in that in the first instance - you accrue
> fragmentation, and only if you were to try to move the fragmented objects
> again would you resolve it.
> 
> Additionally you would not be able to handle the mprotect() case, and you'd
> have the same issue as the madvise() approach in that you'd need to
> essentially re-map each VMA.
> 
> Additionally it would become non-trivial to correctly merge the VMAs - if
> there were more than 3, we would need to invent a new merging mechanism
> specifically for this, hold locks carefully over each to avoid them
> disappearing from beneath us and introduce a great deal of non-optional
> complexity.
> 
> While imperfect, the mremap flag approach seems the least invasive most
> workable solution (until further rework of the anon_vma mechanism can be
> achieved!)

Well, at that point we already have these new flags ... :(

> 
>   include/linux/rmap.h                          |    4 +
>   include/uapi/linux/mman.h                     |    8 +-
>   mm/internal.h                                 |    1 +
>   mm/mremap.c                                   |  719 ++++++-
>   mm/vma.c                                      |   77 +-
>   mm/vma.h                                      |   36 +-

~ +40% on LOC on mm/mremap.c :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
@ 2025-06-16 20:41   ` David Hildenbrand
  2025-06-17  8:34     ` Pedro Falcato
  2025-06-17 10:50   ` Lorenzo Stoakes
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-06-16 20:41 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Pedro Falcato, Rik van Riel, Harry Yoo, Zi Yan,
	Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena,
	Wei Yang, Barry Song, linux-mm, linux-kernel

On 16.06.25 22:24, David Hildenbrand wrote:
> Hi Lorenzo,
> 
> as discussed offline, there is a lot going on an this is rather ... a
> lot of code+complexity for something that is more a corner cases. :)
> 
> Corner-case as in: only select user space will benefit from this, which
> is really a shame.
> 
> After your presentation at LSF/MM, I thought about this further, and I
> was wondering whether:
> 
> (a) We cannot make this semi-automatic, avoiding flags.
> 
> (b) We cannot simplify further by limiting it to the common+easy cases
> first.
> 
> I think you already to some degree did b) as part of this non-RFC, which
> is great.
> 
> 
> So before digging into the details, let's discuss the high level problem
> briefly.
> 
> I think there are three parts to it:
> 
> (1) Detecting whether it is safe to adjust the folio->index (small
>       folios)
> 
> (2) Performance implications of doing so
> 
> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
>       mapped  folios)
> 
> 
> Regarding (1), if we simply track whether a folio was ever used for
> COW-sharing, it would be very easy: and not only for present folios, but
> for any anon folios that are referenced by swap/migration entries.
> Skimming over patch #1, I think you apply a similar logic, which is good.
> 
> Regarding (2), it would apply when we mremap() anon VMAs and they happen
> to reside next to other anon VMAs. Which workloads are we concerned
> about harming by implementing this optimization? I recall that the most
> common use case for mremap() is actually for file mappings, but I might
> be wrong. In any case, we could just have a different way to enable this
> optimization than for each and every mremap() invocation in a process.
> 
> Regarding (3), if we were to split large folios that cross VMA
> boundaries during mremap(), it would be simpler.
> 
> How is it handled in this series if we large folio crosses VMA
> boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).
> 
> 
>> This also creates a difference in behaviour, often surprising to users,
>> between mappings which are faulted and those which are not - as for the
>> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
>>
>> This is problematic firstly because this proliferates kernel allocations
>> that are pure memory pressure - unreclaimable and unmovable -
>> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
>   > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
>> remaps which span multiple VMAs (though it does permit remaps that
>> constitute a part of a single VMA).
> 
> If I mremap() to create a hole and mremap() it back, I would assume to
> automatically get the hole closed again, without special flags. Well, we
> both know this is not the case :)
> 
>   > > This means that a user must concern themselves with whether merges
> succeed
>> or not should they wish to use mremap() in such a way which causes multiple
>> mremap() calls to be performed upon mappings.
> 
> Right.
> 
>>
>> This series provides users with an option to accept the overhead of
>> actually updating the VMA and underlying folios via the
>> MREMAP_RELOCATE_ANON flag.
> 
> Okay. I wish we could avoid this flag ...
> 
>>
>> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
>> the mremap() succeeding, then no attempt is made at relocation of folios as
>> this is not required.
> 
> Makes sense. This is the existing behavior then.
> 
>>
>> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
>> folio->index fields are appropriately updated in order that subsequent
>> mremap() or mprotect() calls will succeed in merging.
> 
> By looking at the surrounding VMAs or simply by trying to always keep
> the folio->index to correspond to the address in the VMA? (just if
> mremap() never happened, I assume?)
> 
>>
>> This flag falls back to the ordinary means of mremap() should the operation
>> not be feasible. It also transparently undoes the operation, carefully
>> holding rmap locks such that no racing rmap operation encounters incorrect
>> or missing VMAs.
> 
> I absolutely dislike this undo operation, really. :(
> 
> I hope we can find a way to just detect early whether this optimization
> would work.
> 
> Which are the exact error cases you can run into for un-doing?
> 
> I assume:
> 
> (a) cow-shared anon folio (can detect early)
> 
> (b) large folios crossing VMAs (TBD)
> 
> (c) KSM folios? Probably we could move them, I *think* we would have to
> update the ksm_rmap_item. Alternatively, we could indicate if a VMA had
> any KSM folios and give up early in the first version.

Looking at patch #1, I can see that we treat KSM folios as "success".

I would have thought we would have to update the corresponding 
"ksm_rmap_item" ... somehow, to keep the rmap working.

I know that Wei Yang (already on cc) is working on selftests, which I am 
yet to review, but he doesn't cover mremap() yet.


Looking at rmap_walk_ksm(), I am left a bit confused.

We walk all entries in the stable tree (ksm_rmap_item), looking in the 
anon_vma interval tree for the entry that corresponds to 
ksm_rmap_item->address.

	addr = rmap_item->address & PAGE_MASK;

	if (addr < vma->vm_start || addr >= vma->vm_end)
		continue;

So I would assume, already when we mremap() ... we are *already* 
breaking KSM rmap walkers? :) Or there is somewhere some magic that I am 
missing.

A KSM mremap test case for rmap would be nice ;)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
@ 2025-06-16 20:58   ` David Hildenbrand
  2025-06-17  6:37     ` Harry Yoo
  2025-06-17 10:07     ` Lorenzo Stoakes
  2025-06-17 11:15   ` Harry Yoo
  2025-06-17 20:09   ` Lorenzo Stoakes
  2 siblings, 2 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-16 20:58 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Pedro Falcato, Rik van Riel, Harry Yoo, Zi Yan,
	Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena,
	Wei Yang, Barry Song, linux-mm, linux-kernel

On 09.06.25 15:26, Lorenzo Stoakes wrote:
> When mremap() moves a mapping around in memory, it goes to great lengths to
> avoid having to walk page tables as this is expensive and
> time-consuming.
> 
> Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> well all the folio indexes pointed at the associated anon_vma object.
> 
> This means the VMA and page tables can simply be moved and this affects the
> change (and if we can move page tables at a higher page table level, this
> is even faster).
> 
> While this is efficient, it does lead to big problems with VMA merging - in
> essence it causes faulted anonymous VMAs to not be mergeable under many
> circumstances once moved.
> 
> This is limiting and leads to both a proliferation of unreclaimable,
> unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> impact on further use of mremap(), which has a requirement that the VMA
> moved (which can also be a partial range within a VMA) may span only a
> single VMA.
> 
> This makes the mergeability or not of VMAs in effect a uAPI concern.
> 
> In some use cases, users may wish to accept the overhead of actually going
> to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> provide them with the choice.
> 
> This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> attempts to perform such an operation. If it is unable to do so, it cleanly
> falls back to the usual method.
> 
> It carefully takes the rmap locks such that at no time will a racing rmap
> user encounter incorrect or missing VMAs.
> 
> It is also designed to interact cleanly with the existing mremap() error
> fallback mechanism (inverting the remap should the page table move fail).
> 
> Also, if we could merge cleanly without such a change, we do so, avoiding
> the overhead of the operation if it is not required.
> 
> In the instance that no merge may occur when the move is performed, we
> still perform the folio and VMA updates to ensure that future mremap() or
> mprotect() calls will result in merges.
> 
> In this implementation, we simply give up if we encounter large folios. A
> subsequent commit will extend the functionality to allow for these cases.
> 
> We restrict this flag to purely anonymous memory only.
> 
> we separate out the vma_had_uncowed_parents() helper function for checking
> in should_relocate_anon() and introduce a new function
> vma_maybe_has_shared_anon_folios() which combines a check against this and
> any forked child anon_vma's.
> 
> We carefully check for pinned folios in case a caller who holds a pin might
> make assumptions about index, mapping fields which we are about to
> manipulate.

Som quick feedback, I did not yet digest everything.

[...]

> +/*
> + * If the folio mapped at the specified pte entry can have its index and mapping
> + * relocated, then do so.
> + *
> + * Returns the number of pages we have traversed, or 0 if the operation failed.
> + */
> +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> +		struct pte_state *state, bool undo)
> +{
> +	struct folio *folio;
> +	struct vm_area_struct *old, *new;
> +	pgoff_t new_index;
> +	pte_t pte;
> +	unsigned long ret = 1;
> +	unsigned long old_addr = state->old_addr;
> +	unsigned long new_addr = state->new_addr;
> +
> +	old = pmc->old;
> +	new = pmc->new;
> +
> +	pte = ptep_get(state->ptep);
> +
> +	/* Ensure we have truly got an anon folio. */
> +	folio = vm_normal_folio(old, old_addr, pte);
> +	if (!folio)
> +		return ret;
> +
> +	folio_lock(folio);
> +
> +	/* No-op. */
> +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> +		goto out;
> +

So these cases are all "pass".

 > +	/*> +	 * This should never be the case as we have already checked 
to ensure
> +	 * that the anon_vma is not forked, and we have just asserted that it is
> +	 * anonymous.
> +	 */
> +	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
> +		goto out;

Good a warning, so we should be able to handle that early.

> +	/* The above check should imply these. */
> +	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
 > +	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));

This can trigger in one nasty case, where we can lose the PAE bit during 
swapin (refault from the swapcache while the folio is under writeback, 
and the device does not allow for modifying the data while under writeback).

> +
> +	/*
> +	 * A pinned folio implies that it will be used for a duration longer
> +	 * than that over which the mmap_lock is held, meaning that another part
> +	 * of the kernel may be making use of this folio.
> +	 *
> +	 * Since we are about to manipulate index & mapping fields, we cannot
> +	 * safely proceed because whatever has pinned this folio may then
> +	 * incorrectly assume these do not change.
> +	 */
> +	if (folio_maybe_dma_pinned(folio))
> +		goto out;

As discussed, this can race with GUP-fast. SO *maybe* we can just allow 
for moving these.

(after all we still have ordinary GUP that would also not be covered by 
this check)

> +
> +	/*
> +	 * This should not happen as we explicitly disallow this, but check
> +	 * anyway.
> +	 */
> +	if (folio_test_large(folio)) {
> +		ret = 0;
> +		goto out;
> +	}

That is the only real problem for rollback so far I assume.

> +
> +	if (!undo)
> +		new_index = linear_page_index(new, new_addr);
> +	else
> +		new_index = linear_page_index(old, old_addr);
> +
> +	/*
> +	 * The PTL should keep us safe from unmapping, and the fact the folio is
> +	 * a PTE keeps the folio referenced.
> +	 *
> +	 * The mmap/VMA locks should keep us safe from fork and other processes.
> +	 *
> +	 * The rmap locks should keep us safe from anything happening to the
> +	 * VMA/anon_vma.
> +	 *
> +	 * The folio lock should keep us safe from reclaim, migration, etc.
> +	 */
> +	folio_move_anon_rmap(folio, undo ? old : new);
> +	WRITE_ONCE(folio->index, new_index);
> +
> +out:
> +	folio_unlock(folio);
> +	return ret;
> +}
> +
> +static bool pte_done(struct pte_state *state)
> +{
> +	return state->old_addr >= state->old_end;
> +}
> +
> +static void pte_next(struct pte_state *state, unsigned long nr_pages)
> +{
> +	state->old_addr += nr_pages * PAGE_SIZE;
> +	state->new_addr += nr_pages * PAGE_SIZE;
> +	state->ptep += nr_pages;
> +}
> +
> +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> +		unsigned long extent, pmd_t *pmdp, bool undo)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct pte_state state = {
> +		.old_addr = pmc->old_addr,
> +		.new_addr = pmc->new_addr,
> +		.old_end = pmc->old_addr + extent,
> +	};
> +	pte_t *ptep_start;
> +	bool ret;
> +	unsigned long nr_pages;
> +
> +	ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
> +	/*
> +	 * We prevent faults with mmap write lock, hold the rmap lock and should
> +	 * not fail to obtain this lock. Just give up if we can't.
> +	 */
> +	if (!ptep_start)
> +		return false;
> +
> +	state.ptep = ptep_start;
> +	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> +		pte_t pte = ptep_get(state.ptep);
> +
> +		if (pte_none(pte) || !pte_present(pte)) {
> +			nr_pages = 1;

What if we have

(a) A migration entry (possibly we might fail migration and simply remap 
the original folio)

(b) A swap entry with a folio in the swapcache that we can refault.

I don't think we can simply skip these ...

> +			continue;
> +		}
> +
> +		nr_pages = relocate_anon_pte(pmc, &state, undo);
> +		if (!nr_pages) {
> +			ret = false;
> +			goto out;
> +		}
> +	}
> +
> +	ret = true;
> +out:
> +	pte_unmap_unlock(ptep_start, state.ptl);
> +	return ret;
> +}
> +
> +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> +{
> +	pud_t *pudp;
> +	pmd_t *pmdp;
> +	unsigned long extent;
> +	struct mm_struct *mm = current->mm;
> +
> +	if (!pmc->len_in)
> +		return true;
> +
> +	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
> +		pmd_t pmd;
> +		pud_t pud;
> +
> +		extent = get_extent(NORMAL_PUD, pmc);
> +
> +		pudp = get_old_pud(mm, pmc->old_addr);
> +		if (!pudp)
> +			continue;
> +		pud = pudp_get(pudp);
> +
> +		if (pud_trans_huge(pud) || pud_devmap(pud))
> +			return false;

We don't support PUD-size THP, why to we have to fail here?

> +
> +		extent = get_extent(NORMAL_PMD, pmc);
> +		pmdp = get_old_pmd(mm, pmc->old_addr);
> +		if (!pmdp)
> +			continue;
> +		pmd = pmdp_get(pmdp);
> +
> +		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
> +		    pmd_devmap(pmd))
> +			return false;

Okay, this case could likely be handled later (present anon folio or 
migration entry; everything else, we can skip).

> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
> +			return false;
> +	}
 > +> +	return true;
> +}
> +
> +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> +{
> +	unsigned long old_addr = pmc->old_addr;
> +	unsigned long new_addr = pmc->new_addr;
> +	bool ret;
> +
> +	ret = __relocate_anon_folios(pmc, undo);
> +
> +	/* Reset state ready for retry. */
> +	pmc->old_addr = old_addr;
> +	pmc->new_addr = new_addr;
> +
> +	return ret;
> +}
> +
>   unsigned long move_page_tables(struct pagetable_move_control *pmc)
>   {
>   	unsigned long extent;
> @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
>   	}
>   }
>   
> +/*
> + * Should we attempt to relocate anonymous folios to the location that the VMA
> + * is being moved to by updating index and mapping fields accordingly?
> + */
> +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> +	struct pagetable_move_control *pmc)
> +{
> +	struct vm_area_struct *old = vrm->vma;
> +
> +	/* Currently we only do this if requested. */
> +	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> +		return false;
> +
> +	/* We can't deal with special or hugetlb mappings. */
> +	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> +		return false;
> +
> +	/* We only support anonymous mappings. */
> +	if (!vma_is_anonymous(old))
> +		return false;

I suspect MAP_PRIVATE file mappings should be easy to extend?

[...]

>   	pmc.new = new_vma;
>   
> +	if (relocate_anon) {
> +		lock_new_anon_vma(new_vma);
> +		pmc.relocate_locked = new_vma;
> +
> +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> +			unsigned long start = new_vma->vm_start;
> +			unsigned long size = new_vma->vm_end - start;
> +
> +			/* Undo if fails. */
> +			relocate_anon_folios(&pmc, /* undo= */true);

You'd assume this cannot fail, but I think it can: imagine concurrent 
GUP-fast ...

I really wish we can find a way to not require the fallback.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (11 preceding siblings ...)
  2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
@ 2025-06-17  5:42 ` Lai, Yi
  2025-06-17  6:45   ` Harry Yoo
  2025-06-25 15:44 ` Lorenzo Stoakes
  13 siblings, 1 reply; 41+ messages in thread
From: Lai, Yi @ 2025-06-17  5:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, David Hildenbrand,
	Pedro Falcato, Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang,
	Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang,
	Barry Song, linux-mm, linux-kernel, yi1.lai

On Mon, Jun 09, 2025 at 02:26:34PM +0100, Lorenzo Stoakes wrote:

Hi Lorenzo Stoakes,

Greetings!

I used Syzkaller and found that there is BUG: sleeping function called from invalid context in __relocate_anon_folios in linux-next next-20250616.

After bisection and the first bad commit is:
"
aaf5c23bf6a4 mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250617_015846___relocate_anon_folios/bzImage_050f8ad7b58d9079455af171ac279c4b9b828c11
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250617_015846___relocate_anon_folios/050f8ad7b58d9079455af171ac279c4b9b828c11_dmesg.log

"
[   51.309319] BUG: sleeping function called from invalid context at ./include/linux/pagemap.h:1112
[   51.309788] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 670, name: repro
[   51.310130] preempt_count: 1, expected: 0
[   51.310316] RCU nest depth: 1, expected: 0
[   51.310502] 4 locks held by repro/670:a
[   51.310675]  #0: ffff88801360abe0 (&mm->mmap_lock){++++}-{4:4}, at: __do_sys_mremap+0x42e/0x1620
[   51.311098]  #1: ffff888011a2f078 (&anon_vma->rwsem/1){+.+.}-{4:4}, at: copy_vma_and_data+0x541/0x1790
[   51.311526]  #2: ffffffff8725c7c0 (rcu_read_lock){....}-{1:3}, at: ___pte_offset_map+0x3f/0x6c0
[   51.311929]  #3: ffff888013e8adf8 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: __pte_offset_map_lock+0x1a2/0x3c0
[   51.312375] Preemption disabled at:
[   51.312377] [<ffffffff81e14222>] __pte_offset_map_lock+0x1a2/0x3c0
[   51.312828] CPU: 0 UID: 0 PID: 670 Comm: repro Not tainted 6.16.0-rc2-next-20250616-050f8ad7b58d #1 PREEMPT(voluntary)
[   51.312837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 044
[   51.312846] Call Trace:
[   51.312850]  <TASK>
[   51.312853]  dump_stack_lvl+0x121/0x150
[   51.312878]  dump_stack+0x19/0x20
[   51.312884]  __might_resched+0x37b/0x5a0
[   51.312900]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.312917]  __might_sleep+0xa3/0x170
[   51.312926]  ? vm_normal_folio+0x8c/0x170
[   51.312938]  __relocate_anon_folios+0xf97/0x2960
[   51.312953]  ? reacquire_held_locks+0xdd/0x1f0
[   51.312970]  ? __pfx___relocate_anon_folios+0x10/0x10
[   51.312982]  ? lock_set_class+0x17a/0x260
[   51.312994]  copy_vma_and_data+0x606/0x1790
[   51.313006]  ? percpu_counter_add_batch+0xd9/0x210
[   51.313028]  ? __pfx_copy_vma_and_data+0x10/0x10
[   51.313035]  ? vms_complete_munmap_vmas+0x525/0x810
[   51.313051]  ? __pfx_do_vmi_align_munmap+0x10/0x10
[   51.313064]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   51.313074]  ? mtree_range_walk+0x728/0xb70
[   51.313089]  ? __lock_acquire+0x412/0x22a0
[   51.313098]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[   51.313107]  ? percpu_counter_add_batch+0xd9/0x210
[   51.313114]  ? debug_smp_processor_id+0x20/0x30
[   51.313131]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313139]  ? lock_is_held_type+0xef/0x150
[   51.313149]  move_vma+0x689/0x1a60
[   51.313161]  ? __pfx_move_vma+0x10/0x10
[   51.313169]  ? cap_mmap_addr+0x58/0x140
[   51.313182]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   51.313191]  ? security_mmap_addr+0x63/0x1b0
[   51.313203]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.313212]  ? __get_unmapped_area+0x1a4/0x440
[   51.313223]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.313232]  ? vrm_set_new_addr+0x21d/0x2b0
[   51.313241]  __do_sys_mremap+0xeb4/0x1620
[   51.313251]  ? __pfx___do_sys_mremap+0x10/0x10
[   51.313261]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313276]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313300]  __x64_sys_mremap+0xc7/0x150
[   51.313307]  ? syscall_trace_enter+0x14d/0x280
[   51.313320]  x64_sys_call+0x1933/0x2150
[   51.313332]  do_syscall_64+0x6d/0x2e0
[   51.313342]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   51.313349] RIP: 0033:0x7ff58583ee5d
[   51.313361] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 898
[   51.313367] RSP: 002b:00007ffd1f3a23e8 EFLAGS: 00000217 ORIG_RAX: 0000000000000019
[   51.313376] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff58583ee5d
[   51.313380] RDX: 0000000000002000 RSI: 0000000000002000 RDI: 000000002022e000
[   51.313384] RBP: 00007ffd1f3a23f0 R08: 000000002038d000 R09: 0000000000000001
[   51.313387] R10: 000000000000000f R11: 0000000000000217 R12: 00007ffd1f3a2508
[   51.313391] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007ff585bab000
[   51.313403]  </TASK>
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

> A long standing issue with VMA merging of anonymous VMAs is the requirement
> to maintain both vma->vm_pgoff and anon_vma compatibility between merge
> candidates.
> 
> For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
> refer to virtual page offsets, that is, va >> PAGE_SHIFT.
> 
> However upon mremap() of an anonymous mapping that has been faulted (that
> is, where vma->anon_vma != NULL), we would then need to walk page tables to
> be able to access let alone manipulate folio->index, mapping fields to
> permit an update of this virtual page offset.
> 
> Therefore in these instances, we do not do so, instead retaining the
> virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
> field, and of course consequently folio->index.
> 
> On each occasion we use linear_page_index() to determine the appropriate
> offset, cleverly offset the vma->vm_pgoff field by the difference between
> the virtual address and actual VMA start.
> 
> Doing so in effect fragments the virtual address space, meaning that we are
> no longer able to merge these VMAs with adjacent ones that could, at least
> theoretically, be merged.
> 
> This also creates a difference in behaviour, often surprising to users,
> between mappings which are faulted and those which are not - as for the
> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> 
> This is problematic firstly because this proliferates kernel allocations
> that are pure memory pressure - unreclaimable and unmovable -
> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> 
> Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> remaps which span multiple VMAs (though it does permit remaps that
> constitute a part of a single VMA).
> 
> This means that a user must concern themselves with whether merges succeed
> or not should they wish to use mremap() in such a way which causes multiple
> mremap() calls to be performed upon mappings.
> 
> This series provides users with an option to accept the overhead of
> actually updating the VMA and underlying folios via the
> MREMAP_RELOCATE_ANON flag.
> 
> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> the mremap() succeeding, then no attempt is made at relocation of folios as
> this is not required.
> 
> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> folio->index fields are appropriately updated in order that subsequent
> mremap() or mprotect() calls will succeed in merging.
> 
> This flag falls back to the ordinary means of mremap() should the operation
> not be feasible. It also transparently undoes the operation, carefully
> holding rmap locks such that no racing rmap operation encounters incorrect
> or missing VMAs.
> 
> In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> user needs to know whether or not the operation succeeded - this flag is
> identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> the mremap() fails with -EFAULT.
> 
> Note that no-op mremap() operations (such as an unpopulated range, or a
> merge that would trivially succeed already) will succeed under
> MREMAP_MUST_RELOCATE_ANON.
> 
> mremap() already walks page tables, so it isn't an order of magntitude
> increase in workload, but constitutes the need to walk to page table leaf
> level and manipulate folios.
> 
> The operations all succeed under THP and in general are compatible with
> underlying large folios of any size. In fact, the larger the folio, the
> more efficient the operation is.
> 
> Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> on the same order of magnitude of ordinary mremap() operations, with both
> exhibiting time to the proportion of the mapping which is populated.
> 
> Of course, mremap() operations that are entirely aligned are significantly
> faster as they need only move a VMA and a smaller number of higher order
> page tables, but this is unavoidable.
> 
> Previous efforts in this area
> =============================
> 
> An approach addressing this issue was previously suggested by Jakub Matena
> in a series posted a few years ago in [0] (and discussed in a masters
> thesis).
> 
> However this was a more general effort which attempted to always make
> anonymous mappings more mergeable, and therefore was not quite ready for
> the upstream limelight. In addition, large folio work which has occurred
> since requires us to carefully consider and account for this.
> 
> This series is more conservative and targeted (one must specific a flag to
> get this behaviour) and additionally goes to great efforts to handle large
> folios and account all of the nitty gritty locking concerns that might
> arise in current kernel code.
> 
> Thanks goes out to Jakub for his efforts however, and hopefully this effort
> to take a slightly different approach to the same problem is pleasing to
> him regardless :)
> 
> [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> 
> Use-cases
> =========
> 
> * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
>   upon which makes use of extensive mremap() operations to perform
>   defragmentation of objects, taking advantage of the plentiful available
>   virtual address space in a 64-bit system.
> 
>   In instances where one VMA is faulted in and another not, merging is not
>   possible, which leads to significant, unreclaimable, kernel metadata
>   overhead and contention on the vm.max_map_count limit.
> 
>   This series eliminates the issue entirely.
> * It was indicated that Android similarly moves memory around and
>   encounters the very same issues as ZGC.
> * SUSE indicate they have encountered similar issues as pertains to an
>   internal client.
> 
> Past approaches
> ===============
> 
> In discussions at LSF/MM/BPF It was suggested that we could make this an
> madvise() operation, however at this point it will be too late to correctly
> perform the merge, requiring an unmap/remap which would be egregious.
> 
> It was further suggested that we simply defer the operation to the point at
> which an mremap() is attempted on multiple immediately adjacent VMAs (that
> is - to allow VMA fragmentation up until the point where it might cause
> perceptible issues with uAPI).
> 
> This is problematic in that in the first instance - you accrue
> fragmentation, and only if you were to try to move the fragmented objects
> again would you resolve it.
> 
> Additionally you would not be able to handle the mprotect() case, and you'd
> have the same issue as the madvise() approach in that you'd need to
> essentially re-map each VMA.
> 
> Additionally it would become non-trivial to correctly merge the VMAs - if
> there were more than 3, we would need to invent a new merging mechanism
> specifically for this, hold locks carefully over each to avoid them
> disappearing from beneath us and introduce a great deal of non-optional
> complexity.
> 
> While imperfect, the mremap flag approach seems the least invasive most
> workable solution (until further rework of the anon_vma mechanism can be
> achieved!)
> 
> Testing
> =======
> 
> * Significantly expanded self-tests, all of which are passing.
> * Explicit testing of forked cases including anon_vma reuse, all passing
>   correctly.
> * Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
>   mremap()'s.
> * Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
>   (kernel compilation, etc.)
> * Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
>   on real hardware.
> 
> Series History
> ==============
> 
> Non-RFC:
> * Rebased on mm-new and fixed merge conflicts, re-confirmed building and
>   all tests passing.
> * Seems to have settled down with all feedback previously raised addressed,
>   so un-RFC'd to propose the series for mainline, timed for the start of
>   the 6.16 rc cycle (thus targeting 6.17).
> 
> RFC v3:
> * Rebased on and fixed conflicts against mm-new.
> * Removed invalid use of folio_test_large_maybe_mapped_shared() in
>   __relocate_large_folio() - this has since been removed and inlined (see
>   [0]) anyway but we should be using folio_maybe_mapped_shared() here at
>   any rate.
> * Moved unnecessary folio large, ksm checks in __relocate_large_folio() to
>   relocate_large_folio() - we already check this in relocate_anon_pte() so
>   this is duplicated in that case.
> * Added new tests explicitly checking that MREMAP_MUST_RELOCATE_ANON fails
>   for forked processes, both forked children with parents as indicated by
>   avc, and forked parents with children.
> * Added anon_vma_assert_locked() helper.
> * Removed vma_had_uncowed_children() as it was incorrectly implemented (it
>   didn't account for grandchildren and descendents being not being
>   self-parented), and replaced with a general
>   vma_maybe_has_shared_anon_folios() function which checks both parent and
>   child VMAs. Wei raised a concern in this area, this helps clarify and
>   correct.
> * Converted anon_vma vs. mmap lock check in
>   vma_maybe_has_shared_anon_folios() to be more sensible and to assume the
>   caller hold sufficient locks (checked with assert).
> * Added additional recipients based on recent MAINTAINERS changes.
> * Added missing reference to Jakub's efforts in this area a few years ago
>   to cover letter. Thanks Jakub!
> https://lore.kernel.org/all/cover.1746305604.git.lorenzo.stoakes@oracle.com/
> 
> RFC v2:
> * Added folio_mapcount() check on relocate anon to assert exclusively
>   mapped as per Jann.
> * Added check for anon_vma->num_children > nr_pages in
>   should_relocate_anon() as per Jann.
> * Separated out vma_had_uncowed_parents() into shared helper function and
>   added vma_had_uncowed_children() to implement the above.
> * Add comment clarifying why we do not require an rmap lock on the old VMA
>   due to fork requiring an mmap write lock which we hold.
> * Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
> * Checked for folio pinning and abort if in place. We do so, because this
>   implies the folio is being used by the kernel for a time longer than the
>   time over which an mmap lock is held (which will not be held at the time
>   of us manipulating the folio, as we hold the mmap write lock). We are
>   manipulating mapping, index fields and being conservative (additionally
>   mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
>   pin isn't somehow relying on these not being manipulated. As per David.
> * Propagated mapcount, maybe DMA pinned checks to large folio logic.
> * Added folio splitting - on second thoughts, it would be a bit silly to
>   simply disallow the request because of large folio misalignment, work
>   around this by splitting the folio in this instance.
> * Added very careful handling around rmap lock, making use of
>   folio_anon_vma(), to ensure we do not deadlock on anon_vma.
> * Prefer vm_normal_folio() to vm_normal_page() & page_folio().
> * Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
> * Provided sys_mremap() helper in vm_util.[ch] to be shared among test
>   callers and de-duplicate. This must be a raw system call, as glibc will
>   otherwise filter the flags.
> * Expanded the mm CoW self-tests to explicitly test with
>   MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
>   exercises split_folio() code paths explicitly. Additionally some cases
>   cannot succeed, so we also exercise undo paths.
> * Added explicit lockdep handling to teach it that we are handling two
>   distinct anon_vma locks so it doesn't spuriously report a deadlock.
> * Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
>   strictly be necessary as we explicitly limit ourselves to unforked
>   anon_vma's, but it is more correct to do so, as this is where the lock is
>   located.
> * Expanded the split_huge_page_test.c test to also test using the
>   MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.
> https://lore.kernel.org/all/cover.1745307301.git.lorenzo.stoakes@oracle.com/
> 
> RFC v1:
> https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/
> 
> Lorenzo Stoakes (11):
>   mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
>   mm/mremap: add MREMAP_MUST_RELOCATE_ANON
>   mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
>   tools UAPI: Update copy of linux/mman.h from the kernel sources
>   tools/testing/selftests: add sys_mremap() helper to vm_util.h
>   tools/testing/selftests: add mremap() cases that merge normally
>   tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
>   tools/testing/selftests: expand mremap() tests for
>     MREMAP_RELOCATE_ANON
>   tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
>   tools/testing/selftests: test relocate anon in split huge page test
>   tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests
> 
>  include/linux/rmap.h                          |    4 +
>  include/uapi/linux/mman.h                     |    8 +-
>  mm/internal.h                                 |    1 +
>  mm/mremap.c                                   |  719 ++++++-
>  mm/vma.c                                      |   77 +-
>  mm/vma.h                                      |   36 +-
>  tools/include/uapi/linux/mman.h               |    8 +-
>  tools/testing/selftests/mm/cow.c              |   23 +-
>  tools/testing/selftests/mm/merge.c            | 1690 ++++++++++++++++-
>  tools/testing/selftests/mm/mremap_test.c      |  262 ++-
>  .../selftests/mm/split_huge_page_test.c       |   25 +-
>  tools/testing/selftests/mm/vm_util.c          |    8 +
>  tools/testing/selftests/mm/vm_util.h          |    3 +
>  tools/testing/vma/vma.c                       |    5 +-
>  tools/testing/vma/vma_internal.h              |   38 +
>  15 files changed, 2732 insertions(+), 175 deletions(-)
> 
> --
> 2.49.0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-16 20:58   ` David Hildenbrand
@ 2025-06-17  6:37     ` Harry Yoo
  2025-06-17  9:52       ` Lorenzo Stoakes
  2025-06-17 10:07     ` Lorenzo Stoakes
  1 sibling, 1 reply; 41+ messages in thread
From: Harry Yoo @ 2025-06-17  6:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Pedro Falcato, Rik van Riel, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> > 
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> > 
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> > 
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> > 
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> > 
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> > 
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> > 
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> > 
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> > 
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> > 
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> > 
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> > 
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> > 
> > We restrict this flag to purely anonymous memory only.
> > 
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> > 
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
> 
> Som quick feedback, I did not yet digest everything.
> 
> > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >   	pmc.new = new_vma;
> > +	if (relocate_anon) {
> > +		lock_new_anon_vma(new_vma);
> > +		pmc.relocate_locked = new_vma;
> > +
> > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > +			unsigned long start = new_vma->vm_start;
> > +			unsigned long size = new_vma->vm_end - start;
> > +
> > +			/* Undo if fails. */
> > +			relocate_anon_folios(&pmc, /* undo= */true);
> 
> You'd assume this cannot fail, but I think it can: imagine concurrent
> GUP-fast ...

Oops, that sounds really bad.

> I really wish we can find a way to not require the fallback.

Maybe split the VMA at the point where it fails, instead of undo?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  5:42 ` Lai, Yi
@ 2025-06-17  6:45   ` Harry Yoo
  2025-06-17  9:33     ` Lorenzo Stoakes
  0 siblings, 1 reply; 41+ messages in thread
From: Harry Yoo @ 2025-06-17  6:45 UTC (permalink / raw)
  To: Lai, Yi
  Cc: Lorenzo Stoakes, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	David Hildenbrand, Pedro Falcato, Rik van Riel, Zi Yan,
	Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena,
	Wei Yang, Barry Song, linux-mm, linux-kernel, yi1.lai

On Tue, Jun 17, 2025 at 01:42:20PM +0800, Lai, Yi wrote:
> On Mon, Jun 09, 2025 at 02:26:34PM +0100, Lorenzo Stoakes wrote:
> 
> Hi Lorenzo Stoakes,
> 
> Greetings!
> 
> I used Syzkaller and found that there is BUG: sleeping function called from invalid context in __relocate_anon_folios in linux-next next-20250616.
> 
> After bisection and the first bad commit is:
> "
> aaf5c23bf6a4 mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
> "
> 
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250617_015846___relocate_anon_folios/bzImage_050f8ad7b58d9079455af171ac279c4b9b828c11
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250617_015846___relocate_anon_folios/050f8ad7b58d9079455af171ac279c4b9b828c11_dmesg.log
> 
> "
> [   51.309319] BUG: sleeping function called from invalid context at ./include/linux/pagemap.h:1112
> [   51.309788] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 670, name: repro
> [   51.310130] preempt_count: 1, expected: 0
> [   51.310316] RCU nest depth: 1, expected: 0
> [   51.310502] 4 locks held by repro/670:a
> [   51.310675]  #0: ffff88801360abe0 (&mm->mmap_lock){++++}-{4:4}, at: __do_sys_mremap+0x42e/0x1620
> [   51.311098]  #1: ffff888011a2f078 (&anon_vma->rwsem/1){+.+.}-{4:4}, at: copy_vma_and_data+0x541/0x1790
> [   51.311526]  #2: ffffffff8725c7c0 (rcu_read_lock){....}-{1:3}, at: ___pte_offset_map+0x3f/0x6c0
> [   51.311929]  #3: ffff888013e8adf8 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: __pte_offset_map_lock+0x1a2/0x3c0
> [   51.312375] Preemption disabled at:
> [   51.312377] [<ffffffff81e14222>] __pte_offset_map_lock+0x1a2/0x3c0
> [   51.312828] CPU: 0 UID: 0 PID: 670 Comm: repro Not tainted 6.16.0-rc2-next-20250616-050f8ad7b58d #1 PREEMPT(voluntary)
> [   51.312837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 044
> [   51.312846] Call Trace:
> [   51.312850]  <TASK>
> [   51.312853]  dump_stack_lvl+0x121/0x150
> [   51.312878]  dump_stack+0x19/0x20
> [   51.312884]  __might_resched+0x37b/0x5a0
> [   51.312900]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [   51.312917]  __might_sleep+0xa3/0x170
> [   51.312926]  ? vm_normal_folio+0x8c/0x170
> [   51.312938]  __relocate_anon_folios+0xf97/0x2960

Looks like it should call folio_trylock() instead of folio_lock()?

-- 
Cheers,
Harry / Hyeonggon

> [   51.312953]  ? reacquire_held_locks+0xdd/0x1f0
> [   51.312970]  ? __pfx___relocate_anon_folios+0x10/0x10
> [   51.312982]  ? lock_set_class+0x17a/0x260
> [   51.312994]  copy_vma_and_data+0x606/0x1790
> [   51.313006]  ? percpu_counter_add_batch+0xd9/0x210
> [   51.313028]  ? __pfx_copy_vma_and_data+0x10/0x10
> [   51.313035]  ? vms_complete_munmap_vmas+0x525/0x810
> [   51.313051]  ? __pfx_do_vmi_align_munmap+0x10/0x10
> [   51.313064]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   51.313074]  ? mtree_range_walk+0x728/0xb70
> [   51.313089]  ? __lock_acquire+0x412/0x22a0
> [   51.313098]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [   51.313107]  ? percpu_counter_add_batch+0xd9/0x210
> [   51.313114]  ? debug_smp_processor_id+0x20/0x30
> [   51.313131]  ? __this_cpu_preempt_check+0x21/0x30
> [   51.313139]  ? lock_is_held_type+0xef/0x150
> [   51.313149]  move_vma+0x689/0x1a60
> [   51.313161]  ? __pfx_move_vma+0x10/0x10
> [   51.313169]  ? cap_mmap_addr+0x58/0x140
> [   51.313182]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   51.313191]  ? security_mmap_addr+0x63/0x1b0
> [   51.313203]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [   51.313212]  ? __get_unmapped_area+0x1a4/0x440
> [   51.313223]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [   51.313232]  ? vrm_set_new_addr+0x21d/0x2b0
> [   51.313241]  __do_sys_mremap+0xeb4/0x1620
> [   51.313251]  ? __pfx___do_sys_mremap+0x10/0x10
> [   51.313261]  ? __this_cpu_preempt_check+0x21/0x30
> [   51.313276]  ? __this_cpu_preempt_check+0x21/0x30
> [   51.313300]  __x64_sys_mremap+0xc7/0x150
> [   51.313307]  ? syscall_trace_enter+0x14d/0x280
> [   51.313320]  x64_sys_call+0x1933/0x2150
> [   51.313332]  do_syscall_64+0x6d/0x2e0
> [   51.313342]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   51.313349] RIP: 0033:0x7ff58583ee5d
> [   51.313361] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 898
> [   51.313367] RSP: 002b:00007ffd1f3a23e8 EFLAGS: 00000217 ORIG_RAX: 0000000000000019
> [   51.313376] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff58583ee5d
> [   51.313380] RDX: 0000000000002000 RSI: 0000000000002000 RDI: 000000002022e000
> [   51.313384] RBP: 00007ffd1f3a23f0 R08: 000000002038d000 R09: 0000000000000001
> [   51.313387] R10: 000000000000000f R11: 0000000000000217 R12: 00007ffd1f3a2508
> [   51.313391] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007ff585bab000
> [   51.313403]  </TASK>
> "
> 
> Hope this cound be insightful to you.
> 
> Regards,
> Yi Lai
> 
> ---
> 
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
> 
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
> 
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
> 
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
> 
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
> 
> 
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install 
> 
> > A long standing issue with VMA merging of anonymous VMAs is the requirement
> > to maintain both vma->vm_pgoff and anon_vma compatibility between merge
> > candidates.
> > 
> > For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
> > refer to virtual page offsets, that is, va >> PAGE_SHIFT.
> > 
> > However upon mremap() of an anonymous mapping that has been faulted (that
> > is, where vma->anon_vma != NULL), we would then need to walk page tables to
> > be able to access let alone manipulate folio->index, mapping fields to
> > permit an update of this virtual page offset.
> > 
> > Therefore in these instances, we do not do so, instead retaining the
> > virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
> > field, and of course consequently folio->index.
> > 
> > On each occasion we use linear_page_index() to determine the appropriate
> > offset, cleverly offset the vma->vm_pgoff field by the difference between
> > the virtual address and actual VMA start.
> > 
> > Doing so in effect fragments the virtual address space, meaning that we are
> > no longer able to merge these VMAs with adjacent ones that could, at least
> > theoretically, be merged.
> > 
> > This also creates a difference in behaviour, often surprising to users,
> > between mappings which are faulted and those which are not - as for the
> > latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> > 
> > This is problematic firstly because this proliferates kernel allocations
> > that are pure memory pressure - unreclaimable and unmovable -
> > i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > 
> > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> > remaps which span multiple VMAs (though it does permit remaps that
> > constitute a part of a single VMA).
> > 
> > This means that a user must concern themselves with whether merges succeed
> > or not should they wish to use mremap() in such a way which causes multiple
> > mremap() calls to be performed upon mappings.
> > 
> > This series provides users with an option to accept the overhead of
> > actually updating the VMA and underlying folios via the
> > MREMAP_RELOCATE_ANON flag.
> > 
> > If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> > the mremap() succeeding, then no attempt is made at relocation of folios as
> > this is not required.
> > 
> > Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> > folio->index fields are appropriately updated in order that subsequent
> > mremap() or mprotect() calls will succeed in merging.
> > 
> > This flag falls back to the ordinary means of mremap() should the operation
> > not be feasible. It also transparently undoes the operation, carefully
> > holding rmap locks such that no racing rmap operation encounters incorrect
> > or missing VMAs.
> > 
> > In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> > user needs to know whether or not the operation succeeded - this flag is
> > identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> > the mremap() fails with -EFAULT.
> > 
> > Note that no-op mremap() operations (such as an unpopulated range, or a
> > merge that would trivially succeed already) will succeed under
> > MREMAP_MUST_RELOCATE_ANON.
> > 
> > mremap() already walks page tables, so it isn't an order of magntitude
> > increase in workload, but constitutes the need to walk to page table leaf
> > level and manipulate folios.
> > 
> > The operations all succeed under THP and in general are compatible with
> > underlying large folios of any size. In fact, the larger the folio, the
> > more efficient the operation is.
> > 
> > Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> > on the same order of magnitude of ordinary mremap() operations, with both
> > exhibiting time to the proportion of the mapping which is populated.
> > 
> > Of course, mremap() operations that are entirely aligned are significantly
> > faster as they need only move a VMA and a smaller number of higher order
> > page tables, but this is unavoidable.
> > 
> > Previous efforts in this area
> > =============================
> > 
> > An approach addressing this issue was previously suggested by Jakub Matena
> > in a series posted a few years ago in [0] (and discussed in a masters
> > thesis).
> > 
> > However this was a more general effort which attempted to always make
> > anonymous mappings more mergeable, and therefore was not quite ready for
> > the upstream limelight. In addition, large folio work which has occurred
> > since requires us to carefully consider and account for this.
> > 
> > This series is more conservative and targeted (one must specific a flag to
> > get this behaviour) and additionally goes to great efforts to handle large
> > folios and account all of the nitty gritty locking concerns that might
> > arise in current kernel code.
> > 
> > Thanks goes out to Jakub for his efforts however, and hopefully this effort
> > to take a slightly different approach to the same problem is pleasing to
> > him regardless :)
> > 
> > [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> > 
> > Use-cases
> > =========
> > 
> > * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
> >   upon which makes use of extensive mremap() operations to perform
> >   defragmentation of objects, taking advantage of the plentiful available
> >   virtual address space in a 64-bit system.
> > 
> >   In instances where one VMA is faulted in and another not, merging is not
> >   possible, which leads to significant, unreclaimable, kernel metadata
> >   overhead and contention on the vm.max_map_count limit.
> > 
> >   This series eliminates the issue entirely.
> > * It was indicated that Android similarly moves memory around and
> >   encounters the very same issues as ZGC.
> > * SUSE indicate they have encountered similar issues as pertains to an
> >   internal client.
> > 
> > Past approaches
> > ===============
> > 
> > In discussions at LSF/MM/BPF It was suggested that we could make this an
> > madvise() operation, however at this point it will be too late to correctly
> > perform the merge, requiring an unmap/remap which would be egregious.
> > 
> > It was further suggested that we simply defer the operation to the point at
> > which an mremap() is attempted on multiple immediately adjacent VMAs (that
> > is - to allow VMA fragmentation up until the point where it might cause
> > perceptible issues with uAPI).
> > 
> > This is problematic in that in the first instance - you accrue
> > fragmentation, and only if you were to try to move the fragmented objects
> > again would you resolve it.
> > 
> > Additionally you would not be able to handle the mprotect() case, and you'd
> > have the same issue as the madvise() approach in that you'd need to
> > essentially re-map each VMA.
> > 
> > Additionally it would become non-trivial to correctly merge the VMAs - if
> > there were more than 3, we would need to invent a new merging mechanism
> > specifically for this, hold locks carefully over each to avoid them
> > disappearing from beneath us and introduce a great deal of non-optional
> > complexity.
> > 
> > While imperfect, the mremap flag approach seems the least invasive most
> > workable solution (until further rework of the anon_vma mechanism can be
> > achieved!)
> > 
> > Testing
> > =======
> > 
> > * Significantly expanded self-tests, all of which are passing.
> > * Explicit testing of forked cases including anon_vma reuse, all passing
> >   correctly.
> > * Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
> >   mremap()'s.
> > * Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
> >   (kernel compilation, etc.)
> > * Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
> >   on real hardware.
> > 
> > Series History
> > ==============
> > 
> > Non-RFC:
> > * Rebased on mm-new and fixed merge conflicts, re-confirmed building and
> >   all tests passing.
> > * Seems to have settled down with all feedback previously raised addressed,
> >   so un-RFC'd to propose the series for mainline, timed for the start of
> >   the 6.16 rc cycle (thus targeting 6.17).
> > 
> > RFC v3:
> > * Rebased on and fixed conflicts against mm-new.
> > * Removed invalid use of folio_test_large_maybe_mapped_shared() in
> >   __relocate_large_folio() - this has since been removed and inlined (see
> >   [0]) anyway but we should be using folio_maybe_mapped_shared() here at
> >   any rate.
> > * Moved unnecessary folio large, ksm checks in __relocate_large_folio() to
> >   relocate_large_folio() - we already check this in relocate_anon_pte() so
> >   this is duplicated in that case.
> > * Added new tests explicitly checking that MREMAP_MUST_RELOCATE_ANON fails
> >   for forked processes, both forked children with parents as indicated by
> >   avc, and forked parents with children.
> > * Added anon_vma_assert_locked() helper.
> > * Removed vma_had_uncowed_children() as it was incorrectly implemented (it
> >   didn't account for grandchildren and descendents being not being
> >   self-parented), and replaced with a general
> >   vma_maybe_has_shared_anon_folios() function which checks both parent and
> >   child VMAs. Wei raised a concern in this area, this helps clarify and
> >   correct.
> > * Converted anon_vma vs. mmap lock check in
> >   vma_maybe_has_shared_anon_folios() to be more sensible and to assume the
> >   caller hold sufficient locks (checked with assert).
> > * Added additional recipients based on recent MAINTAINERS changes.
> > * Added missing reference to Jakub's efforts in this area a few years ago
> >   to cover letter. Thanks Jakub!
> > https://lore.kernel.org/all/cover.1746305604.git.lorenzo.stoakes@oracle.com/
> > 
> > RFC v2:
> > * Added folio_mapcount() check on relocate anon to assert exclusively
> >   mapped as per Jann.
> > * Added check for anon_vma->num_children > nr_pages in
> >   should_relocate_anon() as per Jann.
> > * Separated out vma_had_uncowed_parents() into shared helper function and
> >   added vma_had_uncowed_children() to implement the above.
> > * Add comment clarifying why we do not require an rmap lock on the old VMA
> >   due to fork requiring an mmap write lock which we hold.
> > * Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
> > * Checked for folio pinning and abort if in place. We do so, because this
> >   implies the folio is being used by the kernel for a time longer than the
> >   time over which an mmap lock is held (which will not be held at the time
> >   of us manipulating the folio, as we hold the mmap write lock). We are
> >   manipulating mapping, index fields and being conservative (additionally
> >   mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
> >   pin isn't somehow relying on these not being manipulated. As per David.
> > * Propagated mapcount, maybe DMA pinned checks to large folio logic.
> > * Added folio splitting - on second thoughts, it would be a bit silly to
> >   simply disallow the request because of large folio misalignment, work
> >   around this by splitting the folio in this instance.
> > * Added very careful handling around rmap lock, making use of
> >   folio_anon_vma(), to ensure we do not deadlock on anon_vma.
> > * Prefer vm_normal_folio() to vm_normal_page() & page_folio().
> > * Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
> > * Provided sys_mremap() helper in vm_util.[ch] to be shared among test
> >   callers and de-duplicate. This must be a raw system call, as glibc will
> >   otherwise filter the flags.
> > * Expanded the mm CoW self-tests to explicitly test with
> >   MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
> >   exercises split_folio() code paths explicitly. Additionally some cases
> >   cannot succeed, so we also exercise undo paths.
> > * Added explicit lockdep handling to teach it that we are handling two
> >   distinct anon_vma locks so it doesn't spuriously report a deadlock.
> > * Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
> >   strictly be necessary as we explicitly limit ourselves to unforked
> >   anon_vma's, but it is more correct to do so, as this is where the lock is
> >   located.
> > * Expanded the split_huge_page_test.c test to also test using the
> >   MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.
> > https://lore.kernel.org/all/cover.1745307301.git.lorenzo.stoakes@oracle.com/
> > 
> > RFC v1:
> > https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/
> > 
> > Lorenzo Stoakes (11):
> >   mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
> >   mm/mremap: add MREMAP_MUST_RELOCATE_ANON
> >   mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
> >   tools UAPI: Update copy of linux/mman.h from the kernel sources
> >   tools/testing/selftests: add sys_mremap() helper to vm_util.h
> >   tools/testing/selftests: add mremap() cases that merge normally
> >   tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
> >   tools/testing/selftests: expand mremap() tests for
> >     MREMAP_RELOCATE_ANON
> >   tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
> >   tools/testing/selftests: test relocate anon in split huge page test
> >   tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests
> > 
> >  include/linux/rmap.h                          |    4 +
> >  include/uapi/linux/mman.h                     |    8 +-
> >  mm/internal.h                                 |    1 +
> >  mm/mremap.c                                   |  719 ++++++-
> >  mm/vma.c                                      |   77 +-
> >  mm/vma.h                                      |   36 +-
> >  tools/include/uapi/linux/mman.h               |    8 +-
> >  tools/testing/selftests/mm/cow.c              |   23 +-
> >  tools/testing/selftests/mm/merge.c            | 1690 ++++++++++++++++-
> >  tools/testing/selftests/mm/mremap_test.c      |  262 ++-
> >  .../selftests/mm/split_huge_page_test.c       |   25 +-
> >  tools/testing/selftests/mm/vm_util.c          |    8 +
> >  tools/testing/selftests/mm/vm_util.h          |    3 +
> >  tools/testing/vma/vma.c                       |    5 +-
> >  tools/testing/vma/vma_internal.h              |   38 +
> >  15 files changed, 2732 insertions(+), 175 deletions(-)
> > 
> > --
> > 2.49.0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-16 20:41   ` David Hildenbrand
@ 2025-06-17  8:34     ` Pedro Falcato
  2025-06-17  8:45       ` David Hildenbrand
  2025-06-17 10:20       ` Lorenzo Stoakes
  0 siblings, 2 replies; 41+ messages in thread
From: Pedro Falcato @ 2025-06-17  8:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Mon, Jun 16, 2025 at 10:41:20PM +0200, David Hildenbrand wrote:
> On 16.06.25 22:24, David Hildenbrand wrote:
> > Hi Lorenzo,
> > 
> > as discussed offline, there is a lot going on an this is rather ... a
> > lot of code+complexity for something that is more a corner cases. :)
> > 
> > Corner-case as in: only select user space will benefit from this, which
> > is really a shame.
> > 
> > After your presentation at LSF/MM, I thought about this further, and I
> > was wondering whether:
> > 
> > (a) We cannot make this semi-automatic, avoiding flags.
> > 
> > (b) We cannot simplify further by limiting it to the common+easy cases
> > first.
> > 
> > I think you already to some degree did b) as part of this non-RFC, which
> > is great.
> > 
> > 
> > So before digging into the details, let's discuss the high level problem
> > briefly.
> > 
> > I think there are three parts to it:
> > 
> > (1) Detecting whether it is safe to adjust the folio->index (small
> >       folios)
> > 
> > (2) Performance implications of doing so
> > 
> > (3) Detecting whether it is safe to adjust the folio->index (large PTE-
> >       mapped  folios)
> > 
> > 
> > Regarding (1), if we simply track whether a folio was ever used for
> > COW-sharing, it would be very easy: and not only for present folios, but
> > for any anon folios that are referenced by swap/migration entries.
> > Skimming over patch #1, I think you apply a similar logic, which is good.
> > 
> > Regarding (2), it would apply when we mremap() anon VMAs and they happen
> > to reside next to other anon VMAs. Which workloads are we concerned
> > about harming by implementing this optimization? I recall that the most
> > common use case for mremap() is actually for file mappings, but I might

realloc() for mmapped allocations commonly calls mremap(), FYI (at least for
glibc, and musl; can't bother to look at the rest).

> > be wrong. In any case, we could just have a different way to enable this
> > optimization than for each and every mremap() invocation in a process.

/me thinks of prctl

:P


FWIW, with regards to the whole feature: While I do understand it's purpose (
relocating anon might be too much for most workloads, but great for some), I'm
uncomfortable with the amount of internals we're exposing here. Who's to say
this is how mm rmap looks in 20 years? And we're stuck maintaining the userspace
ABI until then.

Personally, I would prefer if we just had a flag 'MREMAP_HARDER' that would
vaguely be documented as "mremap but harder, even if have to do a little more
work". Then we could move things around without promising RELOCATE_ANON makes
conceptual sense, and userspace wouldn't have to think through the implications
of such a flag by reading Lorenzo's great book.

-- 
Pedro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  8:34     ` Pedro Falcato
@ 2025-06-17  8:45       ` David Hildenbrand
  2025-06-17 10:57         ` Lorenzo Stoakes
  2025-06-17 10:20       ` Lorenzo Stoakes
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-06-17  8:45 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Lorenzo Stoakes, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On 17.06.25 10:34, Pedro Falcato wrote:
> On Mon, Jun 16, 2025 at 10:41:20PM +0200, David Hildenbrand wrote:
>> On 16.06.25 22:24, David Hildenbrand wrote:
>>> Hi Lorenzo,
>>>
>>> as discussed offline, there is a lot going on an this is rather ... a
>>> lot of code+complexity for something that is more a corner cases. :)
>>>
>>> Corner-case as in: only select user space will benefit from this, which
>>> is really a shame.
>>>
>>> After your presentation at LSF/MM, I thought about this further, and I
>>> was wondering whether:
>>>
>>> (a) We cannot make this semi-automatic, avoiding flags.
>>>
>>> (b) We cannot simplify further by limiting it to the common+easy cases
>>> first.
>>>
>>> I think you already to some degree did b) as part of this non-RFC, which
>>> is great.
>>>
>>>
>>> So before digging into the details, let's discuss the high level problem
>>> briefly.
>>>
>>> I think there are three parts to it:
>>>
>>> (1) Detecting whether it is safe to adjust the folio->index (small
>>>        folios)
>>>
>>> (2) Performance implications of doing so
>>>
>>> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
>>>        mapped  folios)
>>>
>>>
>>> Regarding (1), if we simply track whether a folio was ever used for
>>> COW-sharing, it would be very easy: and not only for present folios, but
>>> for any anon folios that are referenced by swap/migration entries.
>>> Skimming over patch #1, I think you apply a similar logic, which is good.
>>>
>>> Regarding (2), it would apply when we mremap() anon VMAs and they happen
>>> to reside next to other anon VMAs. Which workloads are we concerned
>>> about harming by implementing this optimization? I recall that the most
>>> common use case for mremap() is actually for file mappings, but I might
> 
> realloc() for mmapped allocations commonly calls mremap(), FYI (at least for
> glibc, and musl; can't bother to look at the rest).

Good point. Only for larger areas, I assume, where glibc would already 
fallback to expensive mmap()+munmap() instead of using the optimized 
sparse area.

> 
>>> be wrong. In any case, we could just have a different way to enable this
>>> optimization than for each and every mremap() invocation in a process.
> 
> /me thinks of prctl

I didn't want to spell that out :P I don't think this would have to be 
configurable per process ...

> 
> :P
> 
> 
> FWIW, with regards to the whole feature: While I do understand it's purpose (
> relocating anon might be too much for most workloads, but great for some), I'm
> uncomfortable with the amount of internals we're exposing here. Who's to say
> this is how mm rmap looks in 20 years? And we're stuck maintaining the userspace
> ABI until then.

Yes.

> 
> Personally, I would prefer if we just had a flag 'MREMAP_HARDER' that would
> vaguely be documented as "mremap but harder, even if have to do a little more
> work". Then we could move things around without promising RELOCATE_ANON makes
> conceptual sense, and userspace wouldn't have to think through the implications
> of such a flag by reading Lorenzo's great book.

Even such a flag is just weird.

Next time we do MREMAP_EVEN_HARDER

mremap() is already an expensive operation ... so I think we need a 
pretty convincing case to make this configurable by the user at all for 
each individual mremap() invocation.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  6:45   ` Harry Yoo
@ 2025-06-17  9:33     ` Lorenzo Stoakes
  0 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17  9:33 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Lai, Yi, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	David Hildenbrand, Pedro Falcato, Rik van Riel, Zi Yan,
	Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Jakub Matena,
	Wei Yang, Barry Song, linux-mm, linux-kernel, yi1.lai

On Tue, Jun 17, 2025 at 03:45:41PM +0900, Harry Yoo wrote:
> On Tue, Jun 17, 2025 at 01:42:20PM +0800, Lai, Yi wrote:
> > On Mon, Jun 09, 2025 at 02:26:34PM +0100, Lorenzo Stoakes wrote:
> >
> > Hi Lorenzo Stoakes,
> >
> > Greetings!
> >
> > I used Syzkaller and found that there is BUG: sleeping function called from invalid context in __relocate_anon_folios in linux-next next-20250616.
> >
> > After bisection and the first bad commit is:
> > "
> > aaf5c23bf6a4 mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
> > "
> >
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250617_015846___relocate_anon_folios/bzImage_050f8ad7b58d9079455af171ac279c4b9b828c11
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/250617_015846___relocate_anon_folios/050f8ad7b58d9079455af171ac279c4b9b828c11_dmesg.log
> >
> > "
> > [   51.309319] BUG: sleeping function called from invalid context at ./include/linux/pagemap.h:1112
> > [   51.309788] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 670, name: repro
> > [   51.310130] preempt_count: 1, expected: 0
> > [   51.310316] RCU nest depth: 1, expected: 0
> > [   51.310502] 4 locks held by repro/670:a
> > [   51.310675]  #0: ffff88801360abe0 (&mm->mmap_lock){++++}-{4:4}, at: __do_sys_mremap+0x42e/0x1620
> > [   51.311098]  #1: ffff888011a2f078 (&anon_vma->rwsem/1){+.+.}-{4:4}, at: copy_vma_and_data+0x541/0x1790
> > [   51.311526]  #2: ffffffff8725c7c0 (rcu_read_lock){....}-{1:3}, at: ___pte_offset_map+0x3f/0x6c0
> > [   51.311929]  #3: ffff888013e8adf8 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: __pte_offset_map_lock+0x1a2/0x3c0
> > [   51.312375] Preemption disabled at:
> > [   51.312377] [<ffffffff81e14222>] __pte_offset_map_lock+0x1a2/0x3c0
> > [   51.312828] CPU: 0 UID: 0 PID: 670 Comm: repro Not tainted 6.16.0-rc2-next-20250616-050f8ad7b58d #1 PREEMPT(voluntary)
> > [   51.312837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 044
> > [   51.312846] Call Trace:
> > [   51.312850]  <TASK>
> > [   51.312853]  dump_stack_lvl+0x121/0x150
> > [   51.312878]  dump_stack+0x19/0x20
> > [   51.312884]  __might_resched+0x37b/0x5a0
> > [   51.312900]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> > [   51.312917]  __might_sleep+0xa3/0x170
> > [   51.312926]  ? vm_normal_folio+0x8c/0x170
> > [   51.312938]  __relocate_anon_folios+0xf97/0x2960
>
> Looks like it should call folio_trylock() instead of folio_lock()?

I guess it's because we hold the PTE spinlock here... ugh.

We shouldn't be seeing contention on the folio here tbh and if we do then
that's indicative that perhaps we shouldn't proceed...

We could either try for using the lockless PTE stuff or trylock and abort
if can't grab.

Will come up with something...

>
> --
> Cheers,
> Harry / Hyeonggon
>
> > [   51.312953]  ? reacquire_held_locks+0xdd/0x1f0
> > [   51.312970]  ? __pfx___relocate_anon_folios+0x10/0x10
> > [   51.312982]  ? lock_set_class+0x17a/0x260
> > [   51.312994]  copy_vma_and_data+0x606/0x1790
> > [   51.313006]  ? percpu_counter_add_batch+0xd9/0x210
> > [   51.313028]  ? __pfx_copy_vma_and_data+0x10/0x10
> > [   51.313035]  ? vms_complete_munmap_vmas+0x525/0x810
> > [   51.313051]  ? __pfx_do_vmi_align_munmap+0x10/0x10
> > [   51.313064]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [   51.313074]  ? mtree_range_walk+0x728/0xb70
> > [   51.313089]  ? __lock_acquire+0x412/0x22a0
> > [   51.313098]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> > [   51.313107]  ? percpu_counter_add_batch+0xd9/0x210
> > [   51.313114]  ? debug_smp_processor_id+0x20/0x30
> > [   51.313131]  ? __this_cpu_preempt_check+0x21/0x30
> > [   51.313139]  ? lock_is_held_type+0xef/0x150
> > [   51.313149]  move_vma+0x689/0x1a60
> > [   51.313161]  ? __pfx_move_vma+0x10/0x10
> > [   51.313169]  ? cap_mmap_addr+0x58/0x140
> > [   51.313182]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [   51.313191]  ? security_mmap_addr+0x63/0x1b0
> > [   51.313203]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> > [   51.313212]  ? __get_unmapped_area+0x1a4/0x440
> > [   51.313223]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> > [   51.313232]  ? vrm_set_new_addr+0x21d/0x2b0
> > [   51.313241]  __do_sys_mremap+0xeb4/0x1620
> > [   51.313251]  ? __pfx___do_sys_mremap+0x10/0x10
> > [   51.313261]  ? __this_cpu_preempt_check+0x21/0x30
> > [   51.313276]  ? __this_cpu_preempt_check+0x21/0x30
> > [   51.313300]  __x64_sys_mremap+0xc7/0x150
> > [   51.313307]  ? syscall_trace_enter+0x14d/0x280
> > [   51.313320]  x64_sys_call+0x1933/0x2150
> > [   51.313332]  do_syscall_64+0x6d/0x2e0
> > [   51.313342]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   51.313349] RIP: 0033:0x7ff58583ee5d
> > [   51.313361] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 898
> > [   51.313367] RSP: 002b:00007ffd1f3a23e8 EFLAGS: 00000217 ORIG_RAX: 0000000000000019
> > [   51.313376] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff58583ee5d
> > [   51.313380] RDX: 0000000000002000 RSI: 0000000000002000 RDI: 000000002022e000
> > [   51.313384] RBP: 00007ffd1f3a23f0 R08: 000000002038d000 R09: 0000000000000001
> > [   51.313387] R10: 000000000000000f R11: 0000000000000217 R12: 00007ffd1f3a2508
> > [   51.313391] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007ff585bab000
> > [   51.313403]  </TASK>
> > "
> >
> > Hope this cound be insightful to you.
> >
> > Regards,
> > Yi Lai
> >
> > ---
> >
> > If you don't need the following environment to reproduce the problem or if you
> > already have one reproduced environment, please ignore the following information.
> >
> > How to reproduce:
> > git clone https://gitlab.com/xupengfe/repro_vm_env.git
> > cd repro_vm_env
> > tar -xvf repro_vm_env.tar.gz
> > cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
> >   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
> >   // You could change the bzImage_xxx as you want
> >   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> > You could use below command to log in, there is no password for root.
> > ssh -p 10023 root@localhost
> >
> > After login vm(virtual machine) successfully, you could transfer reproduced
> > binary to the vm by below way, and reproduce the problem in vm:
> > gcc -pthread -o repro repro.c
> > scp -P 10023 repro root@localhost:/root/
> >
> > Get the bzImage for target kernel:
> > Please use target kconfig and copy it to kernel_src/.config
> > make olddefconfig
> > make -jx bzImage           //x should equal or less than cpu num your pc has
> >
> > Fill the bzImage file into above start3.sh to load the target kernel in vm.
> >
> >
> > Tips:
> > If you already have qemu-system-x86_64, please ignore below info.
> > If you want to install qemu v7.1.0 version:
> > git clone https://github.com/qemu/qemu.git
> > cd qemu
> > git checkout -f v7.1.0
> > mkdir build
> > cd build
> > yum install -y ninja-build.x86_64
> > yum -y install libslirp-devel.x86_64
> > ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> > make
> > make install
> >
> > > A long standing issue with VMA merging of anonymous VMAs is the requirement
> > > to maintain both vma->vm_pgoff and anon_vma compatibility between merge
> > > candidates.
> > >
> > > For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
> > > refer to virtual page offsets, that is, va >> PAGE_SHIFT.
> > >
> > > However upon mremap() of an anonymous mapping that has been faulted (that
> > > is, where vma->anon_vma != NULL), we would then need to walk page tables to
> > > be able to access let alone manipulate folio->index, mapping fields to
> > > permit an update of this virtual page offset.
> > >
> > > Therefore in these instances, we do not do so, instead retaining the
> > > virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
> > > field, and of course consequently folio->index.
> > >
> > > On each occasion we use linear_page_index() to determine the appropriate
> > > offset, cleverly offset the vma->vm_pgoff field by the difference between
> > > the virtual address and actual VMA start.
> > >
> > > Doing so in effect fragments the virtual address space, meaning that we are
> > > no longer able to merge these VMAs with adjacent ones that could, at least
> > > theoretically, be merged.
> > >
> > > This also creates a difference in behaviour, often surprising to users,
> > > between mappings which are faulted and those which are not - as for the
> > > latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> > >
> > > This is problematic firstly because this proliferates kernel allocations
> > > that are pure memory pressure - unreclaimable and unmovable -
> > > i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > >
> > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> > > remaps which span multiple VMAs (though it does permit remaps that
> > > constitute a part of a single VMA).
> > >
> > > This means that a user must concern themselves with whether merges succeed
> > > or not should they wish to use mremap() in such a way which causes multiple
> > > mremap() calls to be performed upon mappings.
> > >
> > > This series provides users with an option to accept the overhead of
> > > actually updating the VMA and underlying folios via the
> > > MREMAP_RELOCATE_ANON flag.
> > >
> > > If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> > > the mremap() succeeding, then no attempt is made at relocation of folios as
> > > this is not required.
> > >
> > > Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> > > folio->index fields are appropriately updated in order that subsequent
> > > mremap() or mprotect() calls will succeed in merging.
> > >
> > > This flag falls back to the ordinary means of mremap() should the operation
> > > not be feasible. It also transparently undoes the operation, carefully
> > > holding rmap locks such that no racing rmap operation encounters incorrect
> > > or missing VMAs.
> > >
> > > In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> > > user needs to know whether or not the operation succeeded - this flag is
> > > identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> > > the mremap() fails with -EFAULT.
> > >
> > > Note that no-op mremap() operations (such as an unpopulated range, or a
> > > merge that would trivially succeed already) will succeed under
> > > MREMAP_MUST_RELOCATE_ANON.
> > >
> > > mremap() already walks page tables, so it isn't an order of magntitude
> > > increase in workload, but constitutes the need to walk to page table leaf
> > > level and manipulate folios.
> > >
> > > The operations all succeed under THP and in general are compatible with
> > > underlying large folios of any size. In fact, the larger the folio, the
> > > more efficient the operation is.
> > >
> > > Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> > > on the same order of magnitude of ordinary mremap() operations, with both
> > > exhibiting time to the proportion of the mapping which is populated.
> > >
> > > Of course, mremap() operations that are entirely aligned are significantly
> > > faster as they need only move a VMA and a smaller number of higher order
> > > page tables, but this is unavoidable.
> > >
> > > Previous efforts in this area
> > > =============================
> > >
> > > An approach addressing this issue was previously suggested by Jakub Matena
> > > in a series posted a few years ago in [0] (and discussed in a masters
> > > thesis).
> > >
> > > However this was a more general effort which attempted to always make
> > > anonymous mappings more mergeable, and therefore was not quite ready for
> > > the upstream limelight. In addition, large folio work which has occurred
> > > since requires us to carefully consider and account for this.
> > >
> > > This series is more conservative and targeted (one must specific a flag to
> > > get this behaviour) and additionally goes to great efforts to handle large
> > > folios and account all of the nitty gritty locking concerns that might
> > > arise in current kernel code.
> > >
> > > Thanks goes out to Jakub for his efforts however, and hopefully this effort
> > > to take a slightly different approach to the same problem is pleasing to
> > > him regardless :)
> > >
> > > [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> > >
> > > Use-cases
> > > =========
> > >
> > > * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
> > >   upon which makes use of extensive mremap() operations to perform
> > >   defragmentation of objects, taking advantage of the plentiful available
> > >   virtual address space in a 64-bit system.
> > >
> > >   In instances where one VMA is faulted in and another not, merging is not
> > >   possible, which leads to significant, unreclaimable, kernel metadata
> > >   overhead and contention on the vm.max_map_count limit.
> > >
> > >   This series eliminates the issue entirely.
> > > * It was indicated that Android similarly moves memory around and
> > >   encounters the very same issues as ZGC.
> > > * SUSE indicate they have encountered similar issues as pertains to an
> > >   internal client.
> > >
> > > Past approaches
> > > ===============
> > >
> > > In discussions at LSF/MM/BPF It was suggested that we could make this an
> > > madvise() operation, however at this point it will be too late to correctly
> > > perform the merge, requiring an unmap/remap which would be egregious.
> > >
> > > It was further suggested that we simply defer the operation to the point at
> > > which an mremap() is attempted on multiple immediately adjacent VMAs (that
> > > is - to allow VMA fragmentation up until the point where it might cause
> > > perceptible issues with uAPI).
> > >
> > > This is problematic in that in the first instance - you accrue
> > > fragmentation, and only if you were to try to move the fragmented objects
> > > again would you resolve it.
> > >
> > > Additionally you would not be able to handle the mprotect() case, and you'd
> > > have the same issue as the madvise() approach in that you'd need to
> > > essentially re-map each VMA.
> > >
> > > Additionally it would become non-trivial to correctly merge the VMAs - if
> > > there were more than 3, we would need to invent a new merging mechanism
> > > specifically for this, hold locks carefully over each to avoid them
> > > disappearing from beneath us and introduce a great deal of non-optional
> > > complexity.
> > >
> > > While imperfect, the mremap flag approach seems the least invasive most
> > > workable solution (until further rework of the anon_vma mechanism can be
> > > achieved!)
> > >
> > > Testing
> > > =======
> > >
> > > * Significantly expanded self-tests, all of which are passing.
> > > * Explicit testing of forked cases including anon_vma reuse, all passing
> > >   correctly.
> > > * Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
> > >   mremap()'s.
> > > * Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
> > >   (kernel compilation, etc.)
> > > * Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
> > >   on real hardware.
> > >
> > > Series History
> > > ==============
> > >
> > > Non-RFC:
> > > * Rebased on mm-new and fixed merge conflicts, re-confirmed building and
> > >   all tests passing.
> > > * Seems to have settled down with all feedback previously raised addressed,
> > >   so un-RFC'd to propose the series for mainline, timed for the start of
> > >   the 6.16 rc cycle (thus targeting 6.17).
> > >
> > > RFC v3:
> > > * Rebased on and fixed conflicts against mm-new.
> > > * Removed invalid use of folio_test_large_maybe_mapped_shared() in
> > >   __relocate_large_folio() - this has since been removed and inlined (see
> > >   [0]) anyway but we should be using folio_maybe_mapped_shared() here at
> > >   any rate.
> > > * Moved unnecessary folio large, ksm checks in __relocate_large_folio() to
> > >   relocate_large_folio() - we already check this in relocate_anon_pte() so
> > >   this is duplicated in that case.
> > > * Added new tests explicitly checking that MREMAP_MUST_RELOCATE_ANON fails
> > >   for forked processes, both forked children with parents as indicated by
> > >   avc, and forked parents with children.
> > > * Added anon_vma_assert_locked() helper.
> > > * Removed vma_had_uncowed_children() as it was incorrectly implemented (it
> > >   didn't account for grandchildren and descendents being not being
> > >   self-parented), and replaced with a general
> > >   vma_maybe_has_shared_anon_folios() function which checks both parent and
> > >   child VMAs. Wei raised a concern in this area, this helps clarify and
> > >   correct.
> > > * Converted anon_vma vs. mmap lock check in
> > >   vma_maybe_has_shared_anon_folios() to be more sensible and to assume the
> > >   caller hold sufficient locks (checked with assert).
> > > * Added additional recipients based on recent MAINTAINERS changes.
> > > * Added missing reference to Jakub's efforts in this area a few years ago
> > >   to cover letter. Thanks Jakub!
> > > https://lore.kernel.org/all/cover.1746305604.git.lorenzo.stoakes@oracle.com/
> > >
> > > RFC v2:
> > > * Added folio_mapcount() check on relocate anon to assert exclusively
> > >   mapped as per Jann.
> > > * Added check for anon_vma->num_children > nr_pages in
> > >   should_relocate_anon() as per Jann.
> > > * Separated out vma_had_uncowed_parents() into shared helper function and
> > >   added vma_had_uncowed_children() to implement the above.
> > > * Add comment clarifying why we do not require an rmap lock on the old VMA
> > >   due to fork requiring an mmap write lock which we hold.
> > > * Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
> > > * Checked for folio pinning and abort if in place. We do so, because this
> > >   implies the folio is being used by the kernel for a time longer than the
> > >   time over which an mmap lock is held (which will not be held at the time
> > >   of us manipulating the folio, as we hold the mmap write lock). We are
> > >   manipulating mapping, index fields and being conservative (additionally
> > >   mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
> > >   pin isn't somehow relying on these not being manipulated. As per David.
> > > * Propagated mapcount, maybe DMA pinned checks to large folio logic.
> > > * Added folio splitting - on second thoughts, it would be a bit silly to
> > >   simply disallow the request because of large folio misalignment, work
> > >   around this by splitting the folio in this instance.
> > > * Added very careful handling around rmap lock, making use of
> > >   folio_anon_vma(), to ensure we do not deadlock on anon_vma.
> > > * Prefer vm_normal_folio() to vm_normal_page() & page_folio().
> > > * Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
> > > * Provided sys_mremap() helper in vm_util.[ch] to be shared among test
> > >   callers and de-duplicate. This must be a raw system call, as glibc will
> > >   otherwise filter the flags.
> > > * Expanded the mm CoW self-tests to explicitly test with
> > >   MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
> > >   exercises split_folio() code paths explicitly. Additionally some cases
> > >   cannot succeed, so we also exercise undo paths.
> > > * Added explicit lockdep handling to teach it that we are handling two
> > >   distinct anon_vma locks so it doesn't spuriously report a deadlock.
> > > * Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
> > >   strictly be necessary as we explicitly limit ourselves to unforked
> > >   anon_vma's, but it is more correct to do so, as this is where the lock is
> > >   located.
> > > * Expanded the split_huge_page_test.c test to also test using the
> > >   MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.
> > > https://lore.kernel.org/all/cover.1745307301.git.lorenzo.stoakes@oracle.com/
> > >
> > > RFC v1:
> > > https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/
> > >
> > > Lorenzo Stoakes (11):
> > >   mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
> > >   mm/mremap: add MREMAP_MUST_RELOCATE_ANON
> > >   mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
> > >   tools UAPI: Update copy of linux/mman.h from the kernel sources
> > >   tools/testing/selftests: add sys_mremap() helper to vm_util.h
> > >   tools/testing/selftests: add mremap() cases that merge normally
> > >   tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
> > >   tools/testing/selftests: expand mremap() tests for
> > >     MREMAP_RELOCATE_ANON
> > >   tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
> > >   tools/testing/selftests: test relocate anon in split huge page test
> > >   tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests
> > >
> > >  include/linux/rmap.h                          |    4 +
> > >  include/uapi/linux/mman.h                     |    8 +-
> > >  mm/internal.h                                 |    1 +
> > >  mm/mremap.c                                   |  719 ++++++-
> > >  mm/vma.c                                      |   77 +-
> > >  mm/vma.h                                      |   36 +-
> > >  tools/include/uapi/linux/mman.h               |    8 +-
> > >  tools/testing/selftests/mm/cow.c              |   23 +-
> > >  tools/testing/selftests/mm/merge.c            | 1690 ++++++++++++++++-
> > >  tools/testing/selftests/mm/mremap_test.c      |  262 ++-
> > >  .../selftests/mm/split_huge_page_test.c       |   25 +-
> > >  tools/testing/selftests/mm/vm_util.c          |    8 +
> > >  tools/testing/selftests/mm/vm_util.h          |    3 +
> > >  tools/testing/vma/vma.c                       |    5 +-
> > >  tools/testing/vma/vma_internal.h              |   38 +
> > >  15 files changed, 2732 insertions(+), 175 deletions(-)
> > >
> > > --
> > > 2.49.0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  6:37     ` Harry Yoo
@ 2025-06-17  9:52       ` Lorenzo Stoakes
  2025-06-17 10:01         ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17  9:52 UTC (permalink / raw)
  To: Harry Yoo
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Pedro Falcato, Rik van Riel, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 03:37:52PM +0900, Harry Yoo wrote:
> On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> > On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > > When mremap() moves a mapping around in memory, it goes to great lengths to
> > > avoid having to walk page tables as this is expensive and
> > > time-consuming.
> > >
> > > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > > well all the folio indexes pointed at the associated anon_vma object.
> > >
> > > This means the VMA and page tables can simply be moved and this affects the
> > > change (and if we can move page tables at a higher page table level, this
> > > is even faster).
> > >
> > > While this is efficient, it does lead to big problems with VMA merging - in
> > > essence it causes faulted anonymous VMAs to not be mergeable under many
> > > circumstances once moved.
> > >
> > > This is limiting and leads to both a proliferation of unreclaimable,
> > > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > > impact on further use of mremap(), which has a requirement that the VMA
> > > moved (which can also be a partial range within a VMA) may span only a
> > > single VMA.
> > >
> > > This makes the mergeability or not of VMAs in effect a uAPI concern.
> > >
> > > In some use cases, users may wish to accept the overhead of actually going
> > > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > > provide them with the choice.
> > >
> > > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > > attempts to perform such an operation. If it is unable to do so, it cleanly
> > > falls back to the usual method.
> > >
> > > It carefully takes the rmap locks such that at no time will a racing rmap
> > > user encounter incorrect or missing VMAs.
> > >
> > > It is also designed to interact cleanly with the existing mremap() error
> > > fallback mechanism (inverting the remap should the page table move fail).
> > >
> > > Also, if we could merge cleanly without such a change, we do so, avoiding
> > > the overhead of the operation if it is not required.
> > >
> > > In the instance that no merge may occur when the move is performed, we
> > > still perform the folio and VMA updates to ensure that future mremap() or
> > > mprotect() calls will result in merges.
> > >
> > > In this implementation, we simply give up if we encounter large folios. A
> > > subsequent commit will extend the functionality to allow for these cases.
> > >
> > > We restrict this flag to purely anonymous memory only.
> > >
> > > we separate out the vma_had_uncowed_parents() helper function for checking
> > > in should_relocate_anon() and introduce a new function
> > > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > > any forked child anon_vma's.
> > >
> > > We carefully check for pinned folios in case a caller who holds a pin might
> > > make assumptions about index, mapping fields which we are about to
> > > manipulate.
> >
> > Som quick feedback, I did not yet digest everything.
> >
> > > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> > >   	pmc.new = new_vma;
> > > +	if (relocate_anon) {
> > > +		lock_new_anon_vma(new_vma);
> > > +		pmc.relocate_locked = new_vma;
> > > +
> > > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > > +			unsigned long start = new_vma->vm_start;
> > > +			unsigned long size = new_vma->vm_end - start;
> > > +
> > > +			/* Undo if fails. */
> > > +			relocate_anon_folios(&pmc, /* undo= */true);
> >
> > You'd assume this cannot fail, but I think it can: imagine concurrent
> > GUP-fast ...
>
> Oops, that sounds really bad.

I don't think it's quite as bad as it sounds. Let's reserve judgment until we've
fully analysed this and considered different approaches :)

>
> > I really wish we can find a way to not require the fallback.
>
> Maybe split the VMA at the point where it fails, instead of undo?

I don't think this is actually possible without major rework as we've separated
the VMA and folio, page table parts of the operation.

Let me put thoughts on this in reply to David so we don't split conversation
(pun intended ;) I think we have other options also.

>
> --
> Cheers,
> Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  9:52       ` Lorenzo Stoakes
@ 2025-06-17 10:01         ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-17 10:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Pedro Falcato, Rik van Riel,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Jakub Matena, Wei Yang, Barry Song, linux-mm, linux-kernel

>>> I really wish we can find a way to not require the fallback.
>>
>> Maybe split the VMA at the point where it fails, instead of undo?
> 
> I don't think this is actually possible without major rework as we've separated
> the VMA and folio, page table parts of the operation.
> 
> Let me put thoughts on this in reply to David so we don't split conversation
> (pun intended ;) I think we have other options also.

:)

I'm still trying to wrap my head around alternatives, and possible 
simplifications ... oh my is this all complicated.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-16 20:58   ` David Hildenbrand
  2025-06-17  6:37     ` Harry Yoo
@ 2025-06-17 10:07     ` Lorenzo Stoakes
  2025-06-17 12:07       ` David Hildenbrand
  1 sibling, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 10:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> >
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> >
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> >
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> >
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> >
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> >
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> >
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> >
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> >
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> >
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> >
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> >
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> >
> > We restrict this flag to purely anonymous memory only.
> >
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> >
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
>
> Som quick feedback, I did not yet digest everything.

Thanks for taking a look! Appreciated :)

>
> [...]
>
> > +/*
> > + * If the folio mapped at the specified pte entry can have its index and mapping
> > + * relocated, then do so.
> > + *
> > + * Returns the number of pages we have traversed, or 0 if the operation failed.
> > + */
> > +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> > +		struct pte_state *state, bool undo)
> > +{
> > +	struct folio *folio;
> > +	struct vm_area_struct *old, *new;
> > +	pgoff_t new_index;
> > +	pte_t pte;
> > +	unsigned long ret = 1;
> > +	unsigned long old_addr = state->old_addr;
> > +	unsigned long new_addr = state->new_addr;
> > +
> > +	old = pmc->old;
> > +	new = pmc->new;
> > +
> > +	pte = ptep_get(state->ptep);
> > +
> > +	/* Ensure we have truly got an anon folio. */
> > +	folio = vm_normal_folio(old, old_addr, pte);
> > +	if (!folio)
> > +		return ret;
> > +
> > +	folio_lock(folio);
> > +
> > +	/* No-op. */
> > +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> > +		goto out;
> > +
>
> So these cases are all "pass".

Yeah, this is maybe not entirely clear. But it's more like 'we don't need to do
anything with these'.

Really we shouldn't be encountering non-anon folios here given we've checked the
VMA but if we do, somehow, then nothing to do.

>
> > +	/*> +	 * This should never be the case as we have already checked to
> ensure
> > +	 * that the anon_vma is not forked, and we have just asserted that it is
> > +	 * anonymous.
> > +	 */
> > +	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
> > +		goto out;
>
> Good a warning, so we should be able to handle that early.

:)

>
> > +	/* The above check should imply these. */
> > +	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
> > +	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
>
> This can trigger in one nasty case, where we can lose the PAE bit during
> swapin (refault from the swapcache while the folio is under writeback, and
> the device does not allow for modifying the data while under writeback).

Ugh god wasn't aware of that. So maybe drop this second one?

>
> > +
> > +	/*
> > +	 * A pinned folio implies that it will be used for a duration longer
> > +	 * than that over which the mmap_lock is held, meaning that another part
> > +	 * of the kernel may be making use of this folio.
> > +	 *
> > +	 * Since we are about to manipulate index & mapping fields, we cannot
> > +	 * safely proceed because whatever has pinned this folio may then
> > +	 * incorrectly assume these do not change.
> > +	 */
> > +	if (folio_maybe_dma_pinned(folio))
> > +		goto out;
>
> As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
> moving these.

I'm guessing you mean as discussed below? :P Or in the cover letter I've not
read yet? :P

Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
via rmap logic.

I will audit access of these fields just to be safe.

>
> (after all we still have ordinary GUP that would also not be covered by this
> check)
>
> > +
> > +	/*
> > +	 * This should not happen as we explicitly disallow this, but check
> > +	 * anyway.
> > +	 */
> > +	if (folio_test_large(folio)) {
> > +		ret = 0;
> > +		goto out;
> > +	}
>
> That is the only real problem for rollback so far I assume.

Well, this becomes ultimately irrelevant in a later patch where we indeed
support large folios.

>
> > +
> > +	if (!undo)
> > +		new_index = linear_page_index(new, new_addr);
> > +	else
> > +		new_index = linear_page_index(old, old_addr);
> > +
> > +	/*
> > +	 * The PTL should keep us safe from unmapping, and the fact the folio is
> > +	 * a PTE keeps the folio referenced.
> > +	 *
> > +	 * The mmap/VMA locks should keep us safe from fork and other processes.
> > +	 *
> > +	 * The rmap locks should keep us safe from anything happening to the
> > +	 * VMA/anon_vma.
> > +	 *
> > +	 * The folio lock should keep us safe from reclaim, migration, etc.
> > +	 */
> > +	folio_move_anon_rmap(folio, undo ? old : new);
> > +	WRITE_ONCE(folio->index, new_index);
> > +
> > +out:
> > +	folio_unlock(folio);
> > +	return ret;
> > +}
> > +
> > +static bool pte_done(struct pte_state *state)
> > +{
> > +	return state->old_addr >= state->old_end;
> > +}
> > +
> > +static void pte_next(struct pte_state *state, unsigned long nr_pages)
> > +{
> > +	state->old_addr += nr_pages * PAGE_SIZE;
> > +	state->new_addr += nr_pages * PAGE_SIZE;
> > +	state->ptep += nr_pages;
> > +}
> > +
> > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> > +		unsigned long extent, pmd_t *pmdp, bool undo)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct pte_state state = {
> > +		.old_addr = pmc->old_addr,
> > +		.new_addr = pmc->new_addr,
> > +		.old_end = pmc->old_addr + extent,
> > +	};
> > +	pte_t *ptep_start;
> > +	bool ret;
> > +	unsigned long nr_pages;
> > +
> > +	ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
> > +	/*
> > +	 * We prevent faults with mmap write lock, hold the rmap lock and should
> > +	 * not fail to obtain this lock. Just give up if we can't.
> > +	 */
> > +	if (!ptep_start)
> > +		return false;
> > +
> > +	state.ptep = ptep_start;
> > +	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> > +		pte_t pte = ptep_get(state.ptep);
> > +
> > +		if (pte_none(pte) || !pte_present(pte)) {
> > +			nr_pages = 1;
>
> What if we have
>
> (a) A migration entry (possibly we might fail migration and simply remap the
> original folio)
>
> (b) A swap entry with a folio in the swapcache that we can refault.
>
> I don't think we can simply skip these ...

Good point... will investigate these cases.

>
> > +			continue;
> > +		}
> > +
> > +		nr_pages = relocate_anon_pte(pmc, &state, undo);
> > +		if (!nr_pages) {
> > +			ret = false;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	ret = true;
> > +out:
> > +	pte_unmap_unlock(ptep_start, state.ptl);
> > +	return ret;
> > +}
> > +
> > +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > +	pud_t *pudp;
> > +	pmd_t *pmdp;
> > +	unsigned long extent;
> > +	struct mm_struct *mm = current->mm;
> > +
> > +	if (!pmc->len_in)
> > +		return true;
> > +
> > +	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
> > +		pmd_t pmd;
> > +		pud_t pud;
> > +
> > +		extent = get_extent(NORMAL_PUD, pmc);
> > +
> > +		pudp = get_old_pud(mm, pmc->old_addr);
> > +		if (!pudp)
> > +			continue;
> > +		pud = pudp_get(pudp);
> > +
> > +		if (pud_trans_huge(pud) || pud_devmap(pud))
> > +			return false;
>
> We don't support PUD-size THP, why to we have to fail here?

This is just to be in line with other 'magical future where we have PUD THP'
stuff in mremap.c.

A later commit that permits huge folio support actually lets us support these...

>
> > +
> > +		extent = get_extent(NORMAL_PMD, pmc);
> > +		pmdp = get_old_pmd(mm, pmc->old_addr);
> > +		if (!pmdp)
> > +			continue;
> > +		pmd = pmdp_get(pmdp);
> > +
> > +		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
> > +		    pmd_devmap(pmd))
> > +			return false;
>
> Okay, this case could likely be handled later (present anon folio or
> migration entry; everything else, we can skip).

Hmm, but how? the PMD cannot be traversed in this case?

'Present' migration entry? Migration entries are non-present right? :) Or is it
different at PMD?

>
> > +
> > +		if (pmd_none(pmd))
> > +			continue;
> > +
> > +		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
> > +			return false;
> > +	}
> > +> +	return true;
> > +}
> > +
> > +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > +	unsigned long old_addr = pmc->old_addr;
> > +	unsigned long new_addr = pmc->new_addr;
> > +	bool ret;
> > +
> > +	ret = __relocate_anon_folios(pmc, undo);
> > +
> > +	/* Reset state ready for retry. */
> > +	pmc->old_addr = old_addr;
> > +	pmc->new_addr = new_addr;
> > +
> > +	return ret;
> > +}
> > +
> >   unsigned long move_page_tables(struct pagetable_move_control *pmc)
> >   {
> >   	unsigned long extent;
> > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >   	}
> >   }
> > +/*
> > + * Should we attempt to relocate anonymous folios to the location that the VMA
> > + * is being moved to by updating index and mapping fields accordingly?
> > + */
> > +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> > +	struct pagetable_move_control *pmc)
> > +{
> > +	struct vm_area_struct *old = vrm->vma;
> > +
> > +	/* Currently we only do this if requested. */
> > +	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> > +		return false;
> > +
> > +	/* We can't deal with special or hugetlb mappings. */
> > +	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> > +		return false;
> > +
> > +	/* We only support anonymous mappings. */
> > +	if (!vma_is_anonymous(old))
> > +		return false;
>
> I suspect MAP_PRIVATE file mappings should be easy to extend?

Yeah, but perhaps best to be conservative at first.

>
> [...]
>
> >   	pmc.new = new_vma;
> > +	if (relocate_anon) {
> > +		lock_new_anon_vma(new_vma);
> > +		pmc.relocate_locked = new_vma;
> > +
> > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > +			unsigned long start = new_vma->vm_start;
> > +			unsigned long size = new_vma->vm_end - start;
> > +
> > +			/* Undo if fails. */
> > +			relocate_anon_folios(&pmc, /* undo= */true);
>
> You'd assume this cannot fail, but I think it can: imagine concurrent
> GUP-fast ...

Well if we change the racey code to ignore DMA pinned we should be ok right?

>
> I really wish we can find a way to not require the fallback.

Yeah the fallback is horrible but we really do need it. See the page table move
fallback code for nightmares also :)

We could also alternatively:

- Have some kind of anon_vma fragmentation where some folios in range reference
  a different anon_vma that we link to the original VMA (quite possibly very
  broken though).

- Keep a track of folios somehow and separate them from the page table walk (but
  then we risk races)

- Have some way of telling the kernel that such a situation exists with a new
  object that can be pointed to by folio->mapping, that the rmap code recognise,
  like essentially an 'anon_vma migration entry' which can fail.

I already considered combining this operation with the page table move
operation, but the locking gets horrible and the undo is categorically much
worse and I'm not sure it's actually workable.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  8:34     ` Pedro Falcato
  2025-06-17  8:45       ` David Hildenbrand
@ 2025-06-17 10:20       ` Lorenzo Stoakes
  1 sibling, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 10:20 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

Replying to parent but quickly reply here also... out of order, like pulp
fiction...

On Tue, Jun 17, 2025 at 09:34:16AM +0100, Pedro Falcato wrote:
> On Mon, Jun 16, 2025 at 10:41:20PM +0200, David Hildenbrand wrote:
> > On 16.06.25 22:24, David Hildenbrand wrote:
> > > Hi Lorenzo,
> > >
> > > as discussed offline, there is a lot going on an this is rather ... a
> > > lot of code+complexity for something that is more a corner cases. :)
> > >
> > > Corner-case as in: only select user space will benefit from this, which
> > > is really a shame.
> > >
> > > After your presentation at LSF/MM, I thought about this further, and I
> > > was wondering whether:
> > >
> > > (a) We cannot make this semi-automatic, avoiding flags.
> > >
> > > (b) We cannot simplify further by limiting it to the common+easy cases
> > > first.
> > >
> > > I think you already to some degree did b) as part of this non-RFC, which
> > > is great.
> > >
> > >
> > > So before digging into the details, let's discuss the high level problem
> > > briefly.
> > >
> > > I think there are three parts to it:
> > >
> > > (1) Detecting whether it is safe to adjust the folio->index (small
> > >       folios)
> > >
> > > (2) Performance implications of doing so
> > >
> > > (3) Detecting whether it is safe to adjust the folio->index (large PTE-
> > >       mapped  folios)
> > >
> > >
> > > Regarding (1), if we simply track whether a folio was ever used for
> > > COW-sharing, it would be very easy: and not only for present folios, but
> > > for any anon folios that are referenced by swap/migration entries.
> > > Skimming over patch #1, I think you apply a similar logic, which is good.
> > >
> > > Regarding (2), it would apply when we mremap() anon VMAs and they happen
> > > to reside next to other anon VMAs. Which workloads are we concerned
> > > about harming by implementing this optimization? I recall that the most
> > > common use case for mremap() is actually for file mappings, but I might
>
> realloc() for mmapped allocations commonly calls mremap(), FYI (at least for
> glibc, and musl; can't bother to look at the rest).
>
> > > be wrong. In any case, we could just have a different way to enable this
> > > optimization than for each and every mremap() invocation in a process.
>
> /me thinks of prctl
>
> :P

God please :P

>
>
> FWIW, with regards to the whole feature: While I do understand it's purpose (
> relocating anon might be too much for most workloads, but great for some), I'm
> uncomfortable with the amount of internals we're exposing here. Who's to say
> this is how mm rmap looks in 20 years? And we're stuck maintaining the userspace
> ABI until then.

I'm not sure what internals exactly we're exposing... if we have a future (I am
working on it...) where anon rmap works better then these flags become no-ops
right?

Unless you just mean the name?

I am open to changing the name but I think I already changed it based on what I
thought David might say :P

>
> Personally, I would prefer if we just had a flag 'MREMAP_HARDER' that would
> vaguely be documented as "mremap but harder, even if have to do a little more
> work". Then we could move things around without promising RELOCATE_ANON makes
> conceptual sense, and userspace wouldn't have to think through the implications
> of such a flag by reading Lorenzo's great book.

Well thanks for book plug ;)

I think that's far too vague though. I don't think users would have any idea
what that's supposed to help. And then how does the MREMAP_MUST_RELOCATE_ANON
flag work here?

Naming is hard, basically.

>
> --
> Pedro
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
  2025-06-16 20:41   ` David Hildenbrand
@ 2025-06-17 10:50   ` Lorenzo Stoakes
  1 sibling, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 10:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

On Mon, Jun 16, 2025 at 10:24:05PM +0200, David Hildenbrand wrote:
> Hi Lorenzo,
>
> as discussed offline, there is a lot going on an this is rather ... a lot of
> code+complexity for something that is more a corner cases. :)
>
> Corner-case as in: only select user space will benefit from this, which is
> really a shame.

Right, but this is why there's a flag for it. If you don't want to use it, you
don't have to.

I mean one can argue many things in the kernel are attacking corner cases, I
don't think that's an argument against a feature. mremap() _itself_ is a corner
case :)

On a longer-term aside: I'd like to address this in a far more broad fashion, in
fact I literally am now co-maintaining rmap with you largely because I want to
do this :P

So believe me, this is something that will be at least _tried_. But in the
meantime, the idea is we provide a means to work around a very major limitation
of anon remap.

>
> After your presentation at LSF/MM, I thought about this further, and I was
> wondering whether:
>
> (a) We cannot make this semi-automatic, avoiding flags.

I've addressed the suggestions from LSF/MM in the cover letter below.

I don't think this is possible, largely because of the issues around how we
figure out the anon_vma to attach to.

>
> (b) We cannot simplify further by limiting it to the common+easy cases
> first.
>
> I think you already to some degree did b) as part of this non-RFC, which is
> great.

Main simpllifications are - we never touch anything CoW'd, we only allow 'true'
anon.

Well we focus on 'true' anon first (i.e. no MAP_PRIVATE) so we simplify that
way. Otherwise it's pretty complete.

>
>
> So before digging into the details, let's discuss the high level problem
> briefly.
>
> I think there are three parts to it:
>
> (1) Detecting whether it is safe to adjust the folio->index (small
>     folios)
>
> (2) Performance implications of doing so
>
> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
>     mapped  folios)

I think you're forgetting folio->mapping also.

This is where a lot of the complexity is - it's rather chicken-and-egg. You need
to:

a. Know that you cannot currently merge with another anon VMA (and thus avoid
   having to do any of this).
b. Have a new VMA with an anon_vma to which you can relocate the folio.
c. Have that anon_vma locked...

>
>
> Regarding (1), if we simply track whether a folio was ever used for
> COW-sharing, it would be very easy: and not only for present folios, but for
> any anon folios that are referenced by swap/migration entries. Skimming over
> patch #1, I think you apply a similar logic, which is good.

Right.

>
> Regarding (2), it would apply when we mremap() anon VMAs and they happen to
> reside next to other anon VMAs. Which workloads are we concerned about
> harming by implementing this optimization? I recall that the most common use
> case for mremap() is actually for file mappings, but I might be wrong. In
> any case, we could just have a different way to enable this optimization
> than for each and every mremap() invocation in a process.

Yeah we're getting into prctl, mctl hellscape here if we go down that road. And
I want to be conservative here. Having it as an mremap() flag doesn't prevent us
from later doing something policy-ish.

>
> Regarding (3), if we were to split large folios that cross VMA boundaries
> during mremap(), it would be simpler.

The code does that.

>
> How is it handled in this series if we large folio crosses VMA boundaries?
> (a) try splitting or (b) fail (not transparent to the user :( ).

a.

This was a painful thing to work on...

>
>
> > This also creates a difference in behaviour, often surprising to users,
> > between mappings which are faulted and those which are not - as for the
> > latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> >
> > This is problematic firstly because this proliferates kernel allocations
> > that are pure memory pressure - unreclaimable and unmovable -
> > i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> > remaps which span multiple VMAs (though it does permit remaps that
> > constitute a part of a single VMA).
>
> If I mremap() to create a hole and mremap() it back, I would assume to
> automatically get the hole closed again, without special flags. Well, we
> both know this is not the case :)

This is a profoundly confusing thing for users, sadly.

>
> > > This means that a user must concern themselves with whether merges
> succeed
> > or not should they wish to use mremap() in such a way which causes multiple
> > mremap() calls to be performed upon mappings.
>
> Right.
>
> >
> > This series provides users with an option to accept the overhead of
> > actually updating the VMA and underlying folios via the
> > MREMAP_RELOCATE_ANON flag.
>
> Okay. I wish we could avoid this flag ...

Me too... hey I've run kernels with this flag just turned on by default and they
seemed fine ;)

>
> >
> > If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> > the mremap() succeeding, then no attempt is made at relocation of folios as
> > this is not required.
>
> Makes sense. This is the existing behavior then.

Yes, so we have a sane fallback.

>
> >
> > Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> > folio->index fields are appropriately updated in order that subsequent
> > mremap() or mprotect() calls will succeed in merging.
>
> By looking at the surrounding VMAs or simply by trying to always keep the
> folio->index to correspond to the address in the VMA? (just if mremap()
> never happened, I assume?)

This is actually address future mprotect merges for instance (e.g. immediately
adjacent non-compatible VMA gets mprotect()'d to something compatible), or if
other VMAs are mapped adjacent to the moved VMA etc.

It just means, if you set this flag, and the operation succeeds, we will still
change vma->vm_pgoff and folio->index such that the VMA is mergeable with
immediately adjacent, compatible VMAs.

>
> >
> > This flag falls back to the ordinary means of mremap() should the operation
> > not be feasible. It also transparently undoes the operation, carefully
> > holding rmap locks such that no racing rmap operation encounters incorrect
> > or missing VMAs.
>
> I absolutely dislike this undo operation, really. :(

Yes me too. It's a complete horror show.

>
> I hope we can find a way to just detect early whether this optimization
> would work.

Well, the problem is if we encounter something at the folio level right? If
something is unexpected, what then? No matter what we have to clean up our
mess.

We do try our best to ensure that things will succeed.

>
> Which are the exact error cases you can run into for un-doing?
>
> I assume:
>
> (a) cow-shared anon folio (can detect early)

Yes we should.

>
> (b) large folios crossing VMAs (TBD)

Addressed see later patches in series.

>
> (c) KSM folios? Probably we could move them, I *think* we would have to
> update the ksm_rmap_item. Alternatively, we could indicate if a VMA had any
> KSM folios and give up early in the first version.
>
> (d) GUP pins: I think we could allow that ... folio_maybe_dma_pinned() is
> racy either way (GUP-fast!). To deal with GUP-fast we would have to play
> different games ...
>
> Anything else?

Well given the bug report in the thread , we also now have a failure to
obtain the folio lock because we hold PTE lock as a thing.

We could address that with lockless PTE traversal though.

Or we could do what we do in the folio_test_large() handling in
relocate_anon_pte() where we drop/reacquire...

We also have the case where, upon trying to split, we encounter a folio
which already has the currently locked anon_vma set. I can investigate
further how this can happen to determine if we can detect ahead of time.

Finally the folio split can fail...

I feel like we're on thin ice if we try to make an assumption that a
relocate can always succeed.

>
> >
> > In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> > user needs to know whether or not the operation succeeded - this flag is
> > identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> > the mremap() fails with -EFAULT.
>
> How would an APP deal with these errors? Do you have a user in mind that
> could do something sensible based on this error?

Well it's the only way to know if what you wanted actually happened or
not. I guarantee you, that people will complain if the issues they use this
to fix aren't always resolved by this.

They could also use for some retry logic potentially also.

>
> I'm having a hard time imagining that :)

It's useful for testing at the very least, very useful indeed so on this
basis it's worth having and doesn't add too much complexity.

>
> >
> > Note that no-op mremap() operations (such as an unpopulated range, or a
> > merge that would trivially succeed already) will succeed under
> > MREMAP_MUST_RELOCATE_ANON.
> >
> > mremap() already walks page tables, so it isn't an order of magntitude
> > increase in workload, but constitutes the need to walk to page table leaf
> > level and manipulate folios.
>
> Only for anon VMAs, though. Do you have some numbers how bad it is? I mean,
> mremap() is already a pretty invasive/expensive operation ... :) ... which
> is why people started using uffdio_move instead, to avoid  the heavy-weight
> locks.

I got a whole bunch of numbers, I mean things were always within the same
order-of-magnitude, however things are much slower if the existing logic
could just move a higher order page table entry rather than having to
traverse folios, obviously.

I do feel that mremap() perf shouldn't be a consideration given how
heavy-handed it is already as you say. But I'm not sure everybody will
share that view...

>
> >
> > The operations all succeed under THP and in general are compatible with
> > underlying large folios of any size. In fact, the larger the folio, the
> > more efficient the operation is.
>
> Yes.
>
> >
> > Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> > on the same order of magnitude of ordinary mremap() operations, with both
> > exhibiting time to the proportion of the mapping which is populated.
> >
> > Of course, mremap() operations that are entirely aligned are significantly
> > faster as they need only move a VMA and a smaller number of higher order
> > page tables, but this is unavoidable.
> >
> > Previous efforts in this area
> > =============================
> >
> > An approach addressing this issue was previously suggested by Jakub Matena
> > in a series posted a few years ago in [0] (and discussed in a masters
> > thesis).
> >
> > However this was a more general effort which attempted to always make
> > anonymous mappings more mergeable, and therefore was not quite ready for
> > the upstream limelight. In addition, large folio work which has occurred
> > since requires us to carefully consider and account for this.
> >
> > This series is more conservative and targeted (one must specific a flag to
> > get this behaviour) and additionally goes to great efforts to handle large
> > folios and account all of the nitty gritty locking concerns that might
> > arise in current kernel code.
> >
> > Thanks goes out to Jakub for his efforts however, and hopefully this effort
> > to take a slightly different approach to the same problem is pleasing to
> > him regardless :)
> >
> > [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> >
> > Use-cases
> > =========
> >
> > * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
> >    upon which makes use of extensive mremap() operations to perform
> >    defragmentation of objects, taking advantage of the plentiful available
> >    virtual address space in a 64-bit system.
> >
> >    In instances where one VMA is faulted in and another not, merging is not
> >    possible, which leads to significant, unreclaimable, kernel metadata
> >    overhead and contention on the vm.max_map_count limit.
> >
> >    This series eliminates the issue entirely.
> > * It was indicated that Android similarly moves memory around and
> >    encounters the very same issues as ZGC.
>
> Isn't Android using uffdio_move?

I stated this only based on what I was told, I didn't dig deep.

>
> > * SUSE indicate they have encountered similar issues as pertains to an
> >    internal client.
> >
> > Past approaches
> > ===============
> >
> > In discussions at LSF/MM/BPF It was suggested that we could make this an
> > madvise() operation, however at this point it will be too late to correctly
> > perform the merge, requiring an unmap/remap which would be egregious.
> >
> > It was further suggested that we simply defer the operation to the point at
> > which an mremap() is attempted on multiple immediately adjacent VMAs (that
> > is - to allow VMA fragmentation up until the point where it might cause
> > perceptible issues with uAPI).
> >
> > This is problematic in that in the first instance - you accrue
> > fragmentation, and only if you were to try to move the fragmented objects
> > again would you resolve it.
> >
> > Additionally you would not be able to handle the mprotect() case, and you'd
> > have the same issue as the madvise() approach in that you'd need to
> > essentially re-map each VMA.
> >
> > Additionally it would become non-trivial to correctly merge the VMAs - if
> > there were more than 3, we would need to invent a new merging mechanism
> > specifically for this, hold locks carefully over each to avoid them
> > disappearing from beneath us and introduce a great deal of non-optional
> > complexity.
> >
> > While imperfect, the mremap flag approach seems the least invasive most
> > workable solution (until further rework of the anon_vma mechanism can be
> > achieved!)
>
> Well, at that point we already have these new flags ... :(
>
> >
> >   include/linux/rmap.h                          |    4 +
> >   include/uapi/linux/mman.h                     |    8 +-
> >   mm/internal.h                                 |    1 +
> >   mm/mremap.c                                   |  719 ++++++-
> >   mm/vma.c                                      |   77 +-
> >   mm/vma.h                                      |   36 +-
>
> ~ +40% on LOC on mm/mremap.c :(

SLOC is a terrible measure :) I'd suggest counting how much of those are
comments... :)

The mremap() refactor added a bunch of SLOC but a lot of that was comments,
and breaking out very confusing logic into logical parts etc. It also added
more lines than that...

Unfortunately though trying to do anything like this involves added
complexity. I did try to keep it as minimal as possible...

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17  8:45       ` David Hildenbrand
@ 2025-06-17 10:57         ` Lorenzo Stoakes
  2025-06-17 11:58           ` David Hildenbrand
  2025-06-20 18:59           ` Pedro Falcato
  0 siblings, 2 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pedro Falcato, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
> mremap() is already an expensive operation ... so I think we need a pretty
> convincing case to make this configurable by the user at all for each
> individual mremap() invocation.

My measurements suggest, unless you hit a very unfortunate case of -huge
faulted in range all mapped PTE- that the work involved is not all that
much more substantial in terms of order of magnitude than a normal mremap()
operation.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
  2025-06-16 20:58   ` David Hildenbrand
@ 2025-06-17 11:15   ` Harry Yoo
  2025-06-17 11:24     ` Lorenzo Stoakes
  2025-06-17 20:09   ` Lorenzo Stoakes
  2 siblings, 1 reply; 41+ messages in thread
From: Harry Yoo @ 2025-06-17 11:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, David Hildenbrand,
	Pedro Falcato, Rik van Riel, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Mon, Jun 09, 2025 at 02:26:35PM +0100, Lorenzo Stoakes wrote:
> When mremap() moves a mapping around in memory, it goes to great lengths to
> avoid having to walk page tables as this is expensive and
> time-consuming.
> 
> Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> well all the folio indexes pointed at the associated anon_vma object.
> 
> This means the VMA and page tables can simply be moved and this affects the
> change (and if we can move page tables at a higher page table level, this
> is even faster).
> 
> While this is efficient, it does lead to big problems with VMA merging - in
> essence it causes faulted anonymous VMAs to not be mergeable under many
> circumstances once moved.
> 
> This is limiting and leads to both a proliferation of unreclaimable,
> unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> impact on further use of mremap(), which has a requirement that the VMA
> moved (which can also be a partial range within a VMA) may span only a
> single VMA.
> 
> This makes the mergeability or not of VMAs in effect a uAPI concern.
> 
> In some use cases, users may wish to accept the overhead of actually going
> to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> provide them with the choice.
> 
> This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> attempts to perform such an operation. If it is unable to do so, it cleanly
> falls back to the usual method.
> 
> It carefully takes the rmap locks such that at no time will a racing rmap
> user encounter incorrect or missing VMAs.
> 
> It is also designed to interact cleanly with the existing mremap() error
> fallback mechanism (inverting the remap should the page table move fail).
> 
> Also, if we could merge cleanly without such a change, we do so, avoiding
> the overhead of the operation if it is not required.
> 
> In the instance that no merge may occur when the move is performed, we
> still perform the folio and VMA updates to ensure that future mremap() or
> mprotect() calls will result in merges.
> 
> In this implementation, we simply give up if we encounter large folios. A
> subsequent commit will extend the functionality to allow for these cases.
> 
> We restrict this flag to purely anonymous memory only.
> 
> we separate out the vma_had_uncowed_parents() helper function for checking
> in should_relocate_anon() and introduce a new function
> vma_maybe_has_shared_anon_folios() which combines a check against this and
> any forked child anon_vma's.
> 
> We carefully check for pinned folios in case a caller who holds a pin might
> make assumptions about index, mapping fields which we are about to
> manipulate.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/rmap.h             |   4 +
>  include/uapi/linux/mman.h        |   1 +
>  mm/internal.h                    |   1 +
>  mm/mremap.c                      | 403 +++++++++++++++++++++++++++++--
>  mm/vma.c                         |  77 ++++--
>  mm/vma.h                         |  36 ++-
>  tools/testing/vma/vma.c          |   5 +-
>  tools/testing/vma/vma_internal.h |  38 +++
>  8 files changed, 520 insertions(+), 45 deletions(-)

[...snip...]

> @@ -754,6 +797,209 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
>  	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
>  }
>  
> +/*
> + * If the folio mapped at the specified pte entry can have its index and mapping
> + * relocated, then do so.
> + *
> + * Returns the number of pages we have traversed, or 0 if the operation failed.
> + */
> +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> +		struct pte_state *state, bool undo)
> +{
> +	struct folio *folio;
> +	struct vm_area_struct *old, *new;
> +	pgoff_t new_index;
> +	pte_t pte;
> +	unsigned long ret = 1;
> +	unsigned long old_addr = state->old_addr;
> +	unsigned long new_addr = state->new_addr;
> +
> +	old = pmc->old;
> +	new = pmc->new;
> +
> +	pte = ptep_get(state->ptep);
> +
> +	/* Ensure we have truly got an anon folio. */
> +	folio = vm_normal_folio(old, old_addr, pte);
> +	if (!folio)
> +		return ret;
> +
> +	folio_lock(folio);
> +
> +	/* No-op. */
> +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> +		goto out;

I think the kernel should not observe any KSM pages during mremap
because it breaks KSM pages in prep_move_vma()?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 11:15   ` Harry Yoo
@ 2025-06-17 11:24     ` Lorenzo Stoakes
  2025-06-17 11:49       ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 11:24 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, David Hildenbrand,
	Pedro Falcato, Rik van Riel, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 08:15:52PM +0900, Harry Yoo wrote:
> On Mon, Jun 09, 2025 at 02:26:35PM +0100, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> >
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> >
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> >
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> >
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> >
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> >
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> >
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> >
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> >
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> >
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> >
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> >
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> >
> > We restrict this flag to purely anonymous memory only.
> >
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> >
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/rmap.h             |   4 +
> >  include/uapi/linux/mman.h        |   1 +
> >  mm/internal.h                    |   1 +
> >  mm/mremap.c                      | 403 +++++++++++++++++++++++++++++--
> >  mm/vma.c                         |  77 ++++--
> >  mm/vma.h                         |  36 ++-
> >  tools/testing/vma/vma.c          |   5 +-
> >  tools/testing/vma/vma_internal.h |  38 +++
> >  8 files changed, 520 insertions(+), 45 deletions(-)
>
> [...snip...]
>
> > @@ -754,6 +797,209 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
> >  	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
> >  }
> >
> > +/*
> > + * If the folio mapped at the specified pte entry can have its index and mapping
> > + * relocated, then do so.
> > + *
> > + * Returns the number of pages we have traversed, or 0 if the operation failed.
> > + */
> > +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> > +		struct pte_state *state, bool undo)
> > +{
> > +	struct folio *folio;
> > +	struct vm_area_struct *old, *new;
> > +	pgoff_t new_index;
> > +	pte_t pte;
> > +	unsigned long ret = 1;
> > +	unsigned long old_addr = state->old_addr;
> > +	unsigned long new_addr = state->new_addr;
> > +
> > +	old = pmc->old;
> > +	new = pmc->new;
> > +
> > +	pte = ptep_get(state->ptep);
> > +
> > +	/* Ensure we have truly got an anon folio. */
> > +	folio = vm_normal_folio(old, old_addr, pte);
> > +	if (!folio)
> > +		return ret;
> > +
> > +	folio_lock(folio);
> > +
> > +	/* No-op. */
> > +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> > +		goto out;
>
> I think the kernel should not observe any KSM pages during mremap
> because it breaks KSM pages in prep_move_vma()?

Right, nor should we observe !anon pages here since we already checked for
that...

This is belt + braces. Maybe we should replace with VM_WARN_ON_ONCE()'s...?

>
> --
> Cheers,
> Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 11:24     ` Lorenzo Stoakes
@ 2025-06-17 11:49       ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-17 11:49 UTC (permalink / raw)
  To: Lorenzo Stoakes, Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Pedro Falcato, Rik van Riel,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Jakub Matena, Wei Yang, Barry Song, linux-mm, linux-kernel

On 17.06.25 13:24, Lorenzo Stoakes wrote:
> On Tue, Jun 17, 2025 at 08:15:52PM +0900, Harry Yoo wrote:
>> On Mon, Jun 09, 2025 at 02:26:35PM +0100, Lorenzo Stoakes wrote:
>>> When mremap() moves a mapping around in memory, it goes to great lengths to
>>> avoid having to walk page tables as this is expensive and
>>> time-consuming.
>>>
>>> Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
>>> page offset stored in the VMA at vma->vm_pgoff will remain the same, as
>>> well all the folio indexes pointed at the associated anon_vma object.
>>>
>>> This means the VMA and page tables can simply be moved and this affects the
>>> change (and if we can move page tables at a higher page table level, this
>>> is even faster).
>>>
>>> While this is efficient, it does lead to big problems with VMA merging - in
>>> essence it causes faulted anonymous VMAs to not be mergeable under many
>>> circumstances once moved.
>>>
>>> This is limiting and leads to both a proliferation of unreclaimable,
>>> unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
>>> impact on further use of mremap(), which has a requirement that the VMA
>>> moved (which can also be a partial range within a VMA) may span only a
>>> single VMA.
>>>
>>> This makes the mergeability or not of VMAs in effect a uAPI concern.
>>>
>>> In some use cases, users may wish to accept the overhead of actually going
>>> to the trouble of updating VMAs and folios to affect mremap() moves. Let's
>>> provide them with the choice.
>>>
>>> This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
>>> attempts to perform such an operation. If it is unable to do so, it cleanly
>>> falls back to the usual method.
>>>
>>> It carefully takes the rmap locks such that at no time will a racing rmap
>>> user encounter incorrect or missing VMAs.
>>>
>>> It is also designed to interact cleanly with the existing mremap() error
>>> fallback mechanism (inverting the remap should the page table move fail).
>>>
>>> Also, if we could merge cleanly without such a change, we do so, avoiding
>>> the overhead of the operation if it is not required.
>>>
>>> In the instance that no merge may occur when the move is performed, we
>>> still perform the folio and VMA updates to ensure that future mremap() or
>>> mprotect() calls will result in merges.
>>>
>>> In this implementation, we simply give up if we encounter large folios. A
>>> subsequent commit will extend the functionality to allow for these cases.
>>>
>>> We restrict this flag to purely anonymous memory only.
>>>
>>> we separate out the vma_had_uncowed_parents() helper function for checking
>>> in should_relocate_anon() and introduce a new function
>>> vma_maybe_has_shared_anon_folios() which combines a check against this and
>>> any forked child anon_vma's.
>>>
>>> We carefully check for pinned folios in case a caller who holds a pin might
>>> make assumptions about index, mapping fields which we are about to
>>> manipulate.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>   include/linux/rmap.h             |   4 +
>>>   include/uapi/linux/mman.h        |   1 +
>>>   mm/internal.h                    |   1 +
>>>   mm/mremap.c                      | 403 +++++++++++++++++++++++++++++--
>>>   mm/vma.c                         |  77 ++++--
>>>   mm/vma.h                         |  36 ++-
>>>   tools/testing/vma/vma.c          |   5 +-
>>>   tools/testing/vma/vma_internal.h |  38 +++
>>>   8 files changed, 520 insertions(+), 45 deletions(-)
>>
>> [...snip...]
>>
>>> @@ -754,6 +797,209 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
>>>   	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
>>>   }
>>>
>>> +/*
>>> + * If the folio mapped at the specified pte entry can have its index and mapping
>>> + * relocated, then do so.
>>> + *
>>> + * Returns the number of pages we have traversed, or 0 if the operation failed.
>>> + */
>>> +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
>>> +		struct pte_state *state, bool undo)
>>> +{
>>> +	struct folio *folio;
>>> +	struct vm_area_struct *old, *new;
>>> +	pgoff_t new_index;
>>> +	pte_t pte;
>>> +	unsigned long ret = 1;
>>> +	unsigned long old_addr = state->old_addr;
>>> +	unsigned long new_addr = state->new_addr;
>>> +
>>> +	old = pmc->old;
>>> +	new = pmc->new;
>>> +
>>> +	pte = ptep_get(state->ptep);
>>> +
>>> +	/* Ensure we have truly got an anon folio. */
>>> +	folio = vm_normal_folio(old, old_addr, pte);
>>> +	if (!folio)
>>> +		return ret;
>>> +
>>> +	folio_lock(folio);
>>> +
>>> +	/* No-op. */
>>> +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
>>> +		goto out;
>>
>> I think the kernel should not observe any KSM pages during mremap
>> because it breaks KSM pages in prep_move_vma()?

Ah, that's the maigc bit, thanks!

> 
> Right, nor should we observe !anon pages here since we already checked for
> that...
> 
> This is belt + braces. Maybe we should replace with VM_WARN_ON_ONCE()'s...?

Sure. Anything you can throw out probably reduces the overhead :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 10:57         ` Lorenzo Stoakes
@ 2025-06-17 11:58           ` David Hildenbrand
  2025-06-17 12:47             ` Lorenzo Stoakes
  2025-06-20 18:59           ` Pedro Falcato
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-06-17 11:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pedro Falcato, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On 17.06.25 12:57, Lorenzo Stoakes wrote:
> On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
>> mremap() is already an expensive operation ... so I think we need a pretty
>> convincing case to make this configurable by the user at all for each
>> individual mremap() invocation.
> 
> My measurements suggest, unless you hit a very unfortunate case of -huge
> faulted in range all mapped PTE- that the work involved is not all that
> much more substantial in terms of order of magnitude than a normal mremap()
> operation.

Which means we could at least try without such a flag.

Regarding MREMAP_MUST_RELOCATE_ANON, I absolutely hate it.

I'll reply to your comment to my other mail once I get to it.

Users that really care (testing) could figure out if merging worked by 
looking at /proc/. Other users ... no idea what they are even supposed 
to do in that case. Not mremap()? But what is the use case ...

If the "not merged" case would be relevant, a workaround would be ... 
mremapping it simply back?

So if we can, let's just try without any of these flags first.

MREMAP_MUST_RELOCATE_ANON could always be added later on top, once the 
use case for it is clear. Removing it from this series would not make 
this series any less valuable I think.

(and of course, doing it without any flags will make this series much 
more valuable :P )

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 10:07     ` Lorenzo Stoakes
@ 2025-06-17 12:07       ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-17 12:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

> 
>>
>>> +	/* The above check should imply these. */
>>> +	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
>>> +	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
>>
>> This can trigger in one nasty case, where we can lose the PAE bit during
>> swapin (refault from the swapcache while the folio is under writeback, and
>> the device does not allow for modifying the data while under writeback).
> 
> Ugh god wasn't aware of that. So maybe drop this second one?

Yes.

> 
>>
>>> +
>>> +	/*
>>> +	 * A pinned folio implies that it will be used for a duration longer
>>> +	 * than that over which the mmap_lock is held, meaning that another part
>>> +	 * of the kernel may be making use of this folio.
>>> +	 *
>>> +	 * Since we are about to manipulate index & mapping fields, we cannot
>>> +	 * safely proceed because whatever has pinned this folio may then
>>> +	 * incorrectly assume these do not change.
>>> +	 */
>>> +	if (folio_maybe_dma_pinned(folio))
>>> +		goto out;
>>
>> As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
>> moving these.
> 
> I'm guessing you mean as discussed below? :P Or in the cover letter I've not
> read yet? :P

The latter .. IIRC :P It was late ...

> 
> Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
> via rmap logic.
> 
> I will audit access of these fields just to be safe.
> 

[...]

>>> +
>>> +	state.ptep = ptep_start;
>>> +	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
>>> +		pte_t pte = ptep_get(state.ptep);
>>> +
>>> +		if (pte_none(pte) || !pte_present(pte)) {
>>> +			nr_pages = 1;
>>
>> What if we have
>>
>> (a) A migration entry (possibly we might fail migration and simply remap the
>> original folio)
>>
>> (b) A swap entry with a folio in the swapcache that we can refault.
>>
>> I don't think we can simply skip these ...
> 
> Good point... will investigate these cases.

migration entries are really nasty ... probably have to wait for the 
migration entry to become a present pte again.

swap entries ... we could lookup any folio in the swapcache and adjust that.

> 
>>
>>> +			continue;
>>> +		}
>>> +
>>> +		nr_pages = relocate_anon_pte(pmc, &state, undo);
>>> +		if (!nr_pages) {
>>> +			ret = false;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +
>>> +	ret = true;
>>> +out:
>>> +	pte_unmap_unlock(ptep_start, state.ptl);
>>> +	return ret;
>>> +}
>>> +
>>> +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
>>> +{
>>> +	pud_t *pudp;
>>> +	pmd_t *pmdp;
>>> +	unsigned long extent;
>>> +	struct mm_struct *mm = current->mm;
>>> +
>>> +	if (!pmc->len_in)
>>> +		return true;
>>> +
>>> +	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
>>> +		pmd_t pmd;
>>> +		pud_t pud;
>>> +
>>> +		extent = get_extent(NORMAL_PUD, pmc);
>>> +
>>> +		pudp = get_old_pud(mm, pmc->old_addr);
>>> +		if (!pudp)
>>> +			continue;
>>> +		pud = pudp_get(pudp);
>>> +
>>> +		if (pud_trans_huge(pud) || pud_devmap(pud))
>>> +			return false;
>>
>> We don't support PUD-size THP, why to we have to fail here?
> 
> This is just to be in line with other 'magical future where we have PUD THP'
> stuff in mremap.c.
> 
> A later commit that permits huge folio support actually lets us support these...
> 
>>
>>> +
>>> +		extent = get_extent(NORMAL_PMD, pmc);
>>> +		pmdp = get_old_pmd(mm, pmc->old_addr);
>>> +		if (!pmdp)
>>> +			continue;
>>> +		pmd = pmdp_get(pmdp);
>>> +
>>> +		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
>>> +		    pmd_devmap(pmd))
>>> +			return false;
>>
>> Okay, this case could likely be handled later (present anon folio or
>> migration entry; everything else, we can skip).
> 
> Hmm, but how? the PMD cannot be traversed in this case?
> 
> 'Present' migration entry? Migration entries are non-present right? :) Or is it
> different at PMD?

"present anon folio" or "migration entry" :)

So that latter meant a PMD migration entry (that is non-present)

[...]

>>>    	pmc.new = new_vma;
>>> +	if (relocate_anon) {
>>> +		lock_new_anon_vma(new_vma);
>>> +		pmc.relocate_locked = new_vma;
>>> +
>>> +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
>>> +			unsigned long start = new_vma->vm_start;
>>> +			unsigned long size = new_vma->vm_end - start;
>>> +
>>> +			/* Undo if fails. */
>>> +			relocate_anon_folios(&pmc, /* undo= */true);
>>
>> You'd assume this cannot fail, but I think it can: imagine concurrent
>> GUP-fast ...
> 
> Well if we change the racey code to ignore DMA pinned we should be ok right?

We completely block migration/swapout, or could they happen 
concurrently? I assume you'd block them already using the rmap locks in 
write mode.

> 
>>
>> I really wish we can find a way to not require the fallback.
> 
> Yeah the fallback is horrible but we really do need it. See the page table move
> fallback code for nightmares also :)
> 
> We could also alternatively:
> 
> - Have some kind of anon_vma fragmentation where some folios in range reference
>    a different anon_vma that we link to the original VMA (quite possibly very
>    broken though).
> 
> - Keep a track of folios somehow and separate them from the page table walk (but
>    then we risk races)
> 
> - Have some way of telling the kernel that such a situation exists with a new
>    object that can be pointed to by folio->mapping, that the rmap code recognise,
>    like essentially an 'anon_vma migration entry' which can fail.
> 
> I already considered combining this operation with the page table move
> operation, but the locking gets horrible and the undo is categorically much
> worse and I'm not sure it's actually workable.

Yeah, I have to further think about that. :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 11:58           ` David Hildenbrand
@ 2025-06-17 12:47             ` Lorenzo Stoakes
  0 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 12:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pedro Falcato, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 01:58:59PM +0200, David Hildenbrand wrote:
> On 17.06.25 12:57, Lorenzo Stoakes wrote:
> > On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
> > > mremap() is already an expensive operation ... so I think we need a pretty
> > > convincing case to make this configurable by the user at all for each
> > > individual mremap() invocation.
> >
> > My measurements suggest, unless you hit a very unfortunate case of -huge
> > faulted in range all mapped PTE- that the work involved is not all that
> > much more substantial in terms of order of magnitude than a normal mremap()
> > operation.
>
> Which means we could at least try without such a flag.
>
> Regarding MREMAP_MUST_RELOCATE_ANON, I absolutely hate it.

Ack this isn't life or death we can drop it.

>
> I'll reply to your comment to my other mail once I get to it.

Thanks!

I know this is a gnarly series so appreciate you (and Harry of course!) looking.

>
> Users that really care (testing) could figure out if merging worked by
> looking at /proc/. Other users ... no idea what they are even supposed to do
> in that case. Not mremap()? But what is the use case ...
>
> If the "not merged" case would be relevant, a workaround would be ...
> mremapping it simply back?

True true.

>
> So if we can, let's just try without any of these flags first.

Ack.

>
> MREMAP_MUST_RELOCATE_ANON could always be added later on top, once the use
> case for it is clear. Removing it from this series would not make this
> series any less valuable I think.

Sure.

>
> (and of course, doing it without any flags will make this series much more
> valuable :P )

Yeah, I'd prefer this overall if possible...

>
> --
> Cheers,
>
> David / dhildenb
>
>

Thinking of undo, here's an idea:

1. increment refcount on folio
2. isolate from LRU
3. Add to linked list

Do this for all folios.

Then if we encounter a problem we can simply walk the list + fixup, drop
refcount, re-add to LRU.

It's ugly, but it guarantees we can undo what we set?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
  2025-06-16 20:58   ` David Hildenbrand
  2025-06-17 11:15   ` Harry Yoo
@ 2025-06-17 20:09   ` Lorenzo Stoakes
  2 siblings, 0 replies; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 20:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Hi Andrew, I enclose a fixpatch to address a couple issues here.

Obviously a lot of ongoing review but important to address known problems as we
go.

This address the two syzbot reports - one around folio locking in non-sleep
context due to PTE spinlock held [0] and the other around a lock misbalance due
to a coding error [1].

[0]: https://lore.kernel.org/all/aFEAPOozHsR1/PLI@ly-workstation/
[1]: https://lore.kernel.org/all/68512333.a70a0220.395abc.0205.GAE@google.com/

I will (almost certainly) find a better way to address [0], I have an idea
already, but will put in a respin at that point.

----8<----
From 1c0b878afb3c6f9cd8d8518df038182c560f4cc4 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue, 17 Jun 2025 20:56:25 +0100
Subject: [PATCH] fix syzbot reports

Use folio_trylock() to resolve
https://lore.kernel.org/all/aFEAPOozHsR1/PLI@ly-workstation/ and balance
lock/unlock in move_pgt_entry() to fix
https://lore.kernel.org/all/68512333.a70a0220.395abc.0205.GAE@google.com/

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 2da064f8c898..a4ec69959fc7 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -601,12 +601,12 @@ static bool move_pgt_entry(struct pagetable_move_control *pmc,

 	if (!pmc->need_rmap_locks && should_take_rmap_locks(entry)) {
 		override_locks = true;
-
 		pmc->need_rmap_locks = true;
-		/* See comment in move_ptes() */
-		maybe_take_rmap_locks(pmc);
 	}

+	/* See comment in move_ptes() */
+	maybe_take_rmap_locks(pmc);
+
 	switch (entry) {
 	case NORMAL_PMD:
 		moved = move_normal_pmd(pmc, old_entry, new_entry);
@@ -824,7 +824,8 @@ static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
 	if (!folio)
 		return ret;

-	folio_lock(folio);
+	if (!folio_trylock(folio))
+		return 0;

 	/* No-op. */
 	if (!folio_test_anon(folio) || folio_test_ksm(folio))
--
2.49.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-17 10:57         ` Lorenzo Stoakes
  2025-06-17 11:58           ` David Hildenbrand
@ 2025-06-20 18:59           ` Pedro Falcato
  2025-06-20 19:28             ` Lorenzo Stoakes
  1 sibling, 1 reply; 41+ messages in thread
From: Pedro Falcato @ 2025-06-20 18:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 11:57:11AM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
> > mremap() is already an expensive operation ... so I think we need a pretty
> > convincing case to make this configurable by the user at all for each
> > individual mremap() invocation.
> 
> My measurements suggest, unless you hit a very unfortunate case of -huge
> faulted in range all mapped PTE- that the work involved is not all that
> much more substantial in terms of order of magnitude than a normal mremap()
> operation.
> 

Could you share your measurements and/or post them on the cover letter for the
next version?

If indeed it makes no practical difference, maybe we could try to enable it by
default and see what happens...

Or: separate but maybe awful idea, but if the problem is the number of VMAs
maybe we could try harder based on the map count? i.e if 
map_count > (max_map_count / 2), try to relocate anon.

-- 
Pedro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-20 18:59           ` Pedro Falcato
@ 2025-06-20 19:28             ` Lorenzo Stoakes
  2025-06-24  9:38               ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 19:28 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Fri, Jun 20, 2025 at 07:59:17PM +0100, Pedro Falcato wrote:
> On Tue, Jun 17, 2025 at 11:57:11AM +0100, Lorenzo Stoakes wrote:
> > On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
> > > mremap() is already an expensive operation ... so I think we need a pretty
> > > convincing case to make this configurable by the user at all for each
> > > individual mremap() invocation.
> >
> > My measurements suggest, unless you hit a very unfortunate case of -huge
> > faulted in range all mapped PTE- that the work involved is not all that
> > much more substantial in terms of order of magnitude than a normal mremap()
> > operation.
> >
>
> Could you share your measurements and/or post them on the cover letter for the
> next version?

Yeah am going to experiment nad gather some data for the next respin and see
what might be possible.

I will present this kind of data then.

>
> If indeed it makes no practical difference, maybe we could try to enable it by
> default and see what happens...

Well it makes a difference, but the question is how much it matters (we have to
traverse every single PTE for faulted-in memory vs. if we move page tables we
can potentially move at PMD granularity saving 512 traversals, but if the folios
are large then we're not really slower...).

I have some ideas... :)

>
> Or: separate but maybe awful idea, but if the problem is the number of VMAs
> maybe we could try harder based on the map count? i.e if
> map_count > (max_map_count / 2), try to relocate anon.

Interesting, though that'd make some things randomly merge and other stuff not,
and you really have to consistently do this stuff to make things mergeable.

Potentially deciding whether to do it based on heuristics isn't out of the realm
of possiblity though.

Generally speaking I'm going to experiment and come back with something...

>
> --
> Pedro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-20 19:28             ` Lorenzo Stoakes
@ 2025-06-24  9:38               ` David Hildenbrand
  2025-06-24 10:19                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-06-24  9:38 UTC (permalink / raw)
  To: Lorenzo Stoakes, Pedro Falcato
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Liam R . Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Rik van Riel, Harry Yoo,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Jakub Matena, Wei Yang, Barry Song, linux-mm, linux-kernel

On 20.06.25 21:28, Lorenzo Stoakes wrote:
> On Fri, Jun 20, 2025 at 07:59:17PM +0100, Pedro Falcato wrote:
>> On Tue, Jun 17, 2025 at 11:57:11AM +0100, Lorenzo Stoakes wrote:
>>> On Tue, Jun 17, 2025 at 10:45:53AM +0200, David Hildenbrand wrote:
>>>> mremap() is already an expensive operation ... so I think we need a pretty
>>>> convincing case to make this configurable by the user at all for each
>>>> individual mremap() invocation.
>>>
>>> My measurements suggest, unless you hit a very unfortunate case of -huge
>>> faulted in range all mapped PTE- that the work involved is not all that
>>> much more substantial in terms of order of magnitude than a normal mremap()
>>> operation.
>>>
>>
>> Could you share your measurements and/or post them on the cover letter for the
>> next version?
> 
> Yeah am going to experiment nad gather some data for the next respin and see
> what might be possible.
> 
> I will present this kind of data then.
> 
>>
>> If indeed it makes no practical difference, maybe we could try to enable it by
>> default and see what happens...
> 
> Well it makes a difference, but the question is how much it matters (we have to
> traverse every single PTE for faulted-in memory vs. if we move page tables we
> can potentially move at PMD granularity saving 512 traversals, but if the folios
> are large then we're not really slower...).
> 
> I have some ideas... :)

As a first step, we could have some global way to enable/disable the 
optimization system-wide. We could then learn if there is really any 
workload that notices the change, while still having a way to revert to 
the old behavior on affected systems easily.

Just a thought, I still hope we can avoid all that. Again, mremap() is 
not really known for being a very efficient operation.

> 
>>
>> Or: separate but maybe awful idea, but if the problem is the number of VMAs
>> maybe we could try harder based on the map count? i.e if
>> map_count > (max_map_count / 2), try to relocate anon.
> 
> Interesting, though that'd make some things randomly merge and other stuff not,
> and you really have to consistently do this stuff to make things mergeable.

Yes, I'd prefer if we can make it more predictable.

(Of course, the VMA region size could also be used as an input to a 
policy. e.g., small move -> much fragmentation -> merge, large move -> 
less fragmentation -> don't care. Knowing about the use cases that use 
mremap() of anon memory and how they might be affected could be very 
valuable. Maybe it's mostly moving a handful of pages where we most care 
about this optimization?).


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-24  9:38               ` David Hildenbrand
@ 2025-06-24 10:19                 ` Lorenzo Stoakes
  2025-06-24 12:05                   ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-24 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pedro Falcato, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel

On Tue, Jun 24, 2025 at 11:38:59AM +0200, David Hildenbrand wrote:
> On 20.06.25 21:28, Lorenzo Stoakes wrote:
> > I have some ideas... :)

Note that I've been working hard on a respin, figuring out ways to
basically make it so we can't fail to set up folios (afaict) so we get
predictable undo.

Of course we make life very very hard for ourselves in mm :)

>
> As a first step, we could have some global way to enable/disable the
> optimization system-wide. We could then learn if there is really any
> workload that notices the change, while still having a way to revert to the
> old behavior on affected systems easily.

Yeah I was wondering if we could do something like this... I mean we could
hide it in /sys/kernel/mm worst case.

>
> Just a thought, I still hope we can avoid all that. Again, mremap() is not
> really known for being a very efficient operation.

Agreed, and I don't think we should microbenchmark it so much. I think as long
as it's roughly the same order of magnitude time taken then it should be fine?

>
> >
> > >
> > > Or: separate but maybe awful idea, but if the problem is the number of VMAs
> > > maybe we could try harder based on the map count? i.e if
> > > map_count > (max_map_count / 2), try to relocate anon.
> >
> > Interesting, though that'd make some things randomly merge and other stuff not,
> > and you really have to consistently do this stuff to make things mergeable.
>
> Yes, I'd prefer if we can make it more predictable.
>
> (Of course, the VMA region size could also be used as an input to a policy.
> e.g., small move -> much fragmentation -> merge, large move -> less
> fragmentation -> don't care. Knowing about the use cases that use mremap()
> of anon memory and how they might be affected could be very valuable. Maybe
> it's mostly moving a handful of pages where we most care about this
> optimization?).

I think fundamentally there are two problems:

1. Unexpected VMA fragmentation leading to later mremap() failure.
2. Unnecessary VMA proliferation.

So we could fix 1 with a 'allow multiple VMAs to be moved if no resize'
patch. And of course the relocate anon stuff is about 2.

In theory we could combine it, but things could become complicated as then
it's mulitple VMA/anon_vma merges.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Anyway, let me polish up the respin and we can see how that goes :)
stress-ng is helping...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-24 10:19                 ` Lorenzo Stoakes
@ 2025-06-24 12:05                   ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-06-24 12:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pedro Falcato, Andrew Morton, Vlastimil Babka, Jann Horn,
	Liam R . Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Rik van Riel, Harry Yoo, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Jakub Matena, Wei Yang, Barry Song,
	linux-mm, linux-kernel


>>>>
>>>> Or: separate but maybe awful idea, but if the problem is the number of VMAs
>>>> maybe we could try harder based on the map count? i.e if
>>>> map_count > (max_map_count / 2), try to relocate anon.
>>>
>>> Interesting, though that'd make some things randomly merge and other stuff not,
>>> and you really have to consistently do this stuff to make things mergeable.
>>
>> Yes, I'd prefer if we can make it more predictable.
>>
>> (Of course, the VMA region size could also be used as an input to a policy.
>> e.g., small move -> much fragmentation -> merge, large move -> less
>> fragmentation -> don't care. Knowing about the use cases that use mremap()
>> of anon memory and how they might be affected could be very valuable. Maybe
>> it's mostly moving a handful of pages where we most care about this
>> optimization?).
> 
> I think fundamentally there are two problems:
> 
> 1. Unexpected VMA fragmentation leading to later mremap() failure.
> 2. Unnecessary VMA proliferation.
> 
> So we could fix 1 with a 'allow multiple VMAs to be moved if no resize'
> patch.

Yes. Which might end up easier (well, okay, different level of 
complexity, at least not messing with folio->)

And of course the relocate anon stuff is about 2.
> 
> In theory we could combine it, but things could become complicated as then
> it's mulitple VMA/anon_vma merges.

Yes.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
                   ` (12 preceding siblings ...)
  2025-06-17  5:42 ` Lai, Yi
@ 2025-06-25 15:44 ` Lorenzo Stoakes
  2025-06-25 15:58   ` Andrew Morton
  13 siblings, 1 reply; 41+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25 15:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

Hi Andrew,

I'm doing a significant respin of this based on feedback, I was hoping to get it
out to you earlier, but it's turning out to be stubbornly tricky (rmap was a
mistake ;).

Dan's found a couple bugs that are addressed in the respin, so for the sake of
stability can we drop this series for now and I"ll get the reworked one out
asap?

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  2025-06-25 15:44 ` Lorenzo Stoakes
@ 2025-06-25 15:58   ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2025-06-25 15:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Jann Horn, Liam R . Howlett, Suren Baghdasaryan,
	Matthew Wilcox, David Hildenbrand, Pedro Falcato, Rik van Riel,
	Harry Yoo, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Jakub Matena, Wei Yang, Barry Song, linux-mm,
	linux-kernel

On Wed, 25 Jun 2025 16:44:32 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Hi Andrew,
> 
> I'm doing a significant respin of this based on feedback, I was hoping to get it
> out to you earlier, but it's turning out to be stubbornly tricky (rmap was a
> mistake ;).
> 
> Dan's found a couple bugs that are addressed in the respin, so for the sake of
> stability can we drop this series for now and I"ll get the reworked one out
> asap?
> 

Gone.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-06-25 15:58 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
2025-06-16 20:58   ` David Hildenbrand
2025-06-17  6:37     ` Harry Yoo
2025-06-17  9:52       ` Lorenzo Stoakes
2025-06-17 10:01         ` David Hildenbrand
2025-06-17 10:07     ` Lorenzo Stoakes
2025-06-17 12:07       ` David Hildenbrand
2025-06-17 11:15   ` Harry Yoo
2025-06-17 11:24     ` Lorenzo Stoakes
2025-06-17 11:49       ` David Hildenbrand
2025-06-17 20:09   ` Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 02/11] mm/mremap: add MREMAP_MUST_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 03/11] mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 04/11] tools UAPI: Update copy of linux/mman.h from the kernel sources Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 05/11] tools/testing/selftests: add sys_mremap() helper to vm_util.h Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 06/11] tools/testing/selftests: add mremap() cases that merge normally Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 07/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 08/11] tools/testing/selftests: expand mremap() tests for MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 09/11] tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 10/11] tools/testing/selftests: test relocate anon in split huge page test Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 11/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests Lorenzo Stoakes
2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
2025-06-16 20:41   ` David Hildenbrand
2025-06-17  8:34     ` Pedro Falcato
2025-06-17  8:45       ` David Hildenbrand
2025-06-17 10:57         ` Lorenzo Stoakes
2025-06-17 11:58           ` David Hildenbrand
2025-06-17 12:47             ` Lorenzo Stoakes
2025-06-20 18:59           ` Pedro Falcato
2025-06-20 19:28             ` Lorenzo Stoakes
2025-06-24  9:38               ` David Hildenbrand
2025-06-24 10:19                 ` Lorenzo Stoakes
2025-06-24 12:05                   ` David Hildenbrand
2025-06-17 10:20       ` Lorenzo Stoakes
2025-06-17 10:50   ` Lorenzo Stoakes
2025-06-17  5:42 ` Lai, Yi
2025-06-17  6:45   ` Harry Yoo
2025-06-17  9:33     ` Lorenzo Stoakes
2025-06-25 15:44 ` Lorenzo Stoakes
2025-06-25 15:58   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).