[RFC PATCH 0/6] Remove XA_ZERO from error recovery of

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
@ 2025-08-15 19:10 Liam R. Howlett
  2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

Before you read on, please take a moment to acknowledge that David
Hildenbrand asked for this, so I'm blaming mostly him :)

It is possible that the dup_mmap() call fails on allocating or setting
up a vma after the maple tree of the oldmm is copied.  Today, that
failure point is marked by inserting an XA_ZERO entry over the failure
point so that the exact location does not need to be communicated
through to exit_mmap().

However, a race exists in the tear down process because the dup_mmap()
drops the mmap lock before exit_mmap() can remove the partially set up
vma tree.  This means that other tasks may get to the mm tree and find
the invalid vma pointer (since it's an XA_ZERO entry), even though the
mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.

To remove the race fully, the tree must be cleaned up before dropping
the lock.  This is accomplished by extracting the vma cleanup in
exit_mmap() and changing the required functions to pass through the vma
search limit.

This does run the risk of increasing the possibility of finding no vmas
(which is already possible!) in code this isn't careful.

The passing of so many limits and variables was such a mess when the
dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
entry marker, but since the swap case was the second time we've hit
cases of walking an almost-dead mm, here's the alternative to checking
MMF_UNSTABLE before wandering into other mm structs.

[1].  https://lore.kernel.org/all/2e8df53b-d953-43fb-9c69-7d7d60e95c9a@redhat.com/

Liam R. Howlett (6):
  mm/mmap: Move exit_mmap() trace point
  mm/mmap: Abstract vma clean up from exit_mmap()
  mm/vma: Add limits to unmap_region() for vmas
  mm/memory: Add tree limit to free_pgtables()
  mm/vma: Add page table limit to unmap_region()
  mm: Change dup_mmap() recovery

 mm/internal.h |  4 ++-
 mm/memory.c   | 13 ++++-----
 mm/mmap.c     | 80 ++++++++++++++++++++++++++++++++++-----------------
 mm/vma.c      | 10 +++++--
 mm/vma.h      |  1 +
 5 files changed, 70 insertions(+), 38 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-19 18:27   ` Lorenzo Stoakes
  2025-08-21 21:12   ` Chris Li
  2025-08-15 19:10 ` [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap() Liam R. Howlett
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

Move the trace point later in the function so that it is not skipped in
the event of a failed fork.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7306253cc3b57..c4c315b480af7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1310,9 +1310,9 @@ void exit_mmap(struct mm_struct *mm)
 
 	BUG_ON(count != mm->map_count);
 
-	trace_exit_mmap(mm);
 destroy:
 	__mt_destroy(&mm->mm_mt);
+	trace_exit_mmap(mm);
 	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap()
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
  2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-19 18:38   ` Lorenzo Stoakes
  2025-08-15 19:10 ` [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas Liam R. Howlett
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

Create the new function tear_down_vmas() to remove a range of vmas.
exit_mmap() will be removing all the vmas.

This is necessary for future patches.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index c4c315b480af7..0995a48b46d59 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1250,6 +1250,29 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags)
 }
 EXPORT_SYMBOL(vm_brk_flags);
 
+static inline
+unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
+		struct vm_area_struct *vma, unsigned long max)
+{
+	unsigned long nr_accounted = 0;
+	int count = 0;
+
+	mmap_assert_write_locked(mm);
+	vma_iter_set(vmi, vma->vm_end);
+	do {
+		if (vma->vm_flags & VM_ACCOUNT)
+			nr_accounted += vma_pages(vma);
+		vma_mark_detached(vma);
+		remove_vma(vma);
+		count++;
+		cond_resched();
+		vma = vma_next(vmi);
+	} while (vma && vma->vm_end <= max);
+
+	BUG_ON(count != mm->map_count);
+	return nr_accounted;
+}
+
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
@@ -1257,7 +1280,6 @@ void exit_mmap(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	VMA_ITERATOR(vmi, mm, 0);
-	int count = 0;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -1297,18 +1319,7 @@ void exit_mmap(struct mm_struct *mm)
 	 * enabled, without holding any MM locks besides the unreachable
 	 * mmap_write_lock.
 	 */
-	vma_iter_set(&vmi, vma->vm_end);
-	do {
-		if (vma->vm_flags & VM_ACCOUNT)
-			nr_accounted += vma_pages(vma);
-		vma_mark_detached(vma);
-		remove_vma(vma);
-		count++;
-		cond_resched();
-		vma = vma_next(&vmi);
-	} while (vma && likely(!xa_is_zero(vma)));
-
-	BUG_ON(count != mm->map_count);
+	nr_accounted = tear_down_vmas(mm, &vmi, vma, ULONG_MAX);
 
 destroy:
 	__mt_destroy(&mm->mm_mt);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
  2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
  2025-08-15 19:10 ` [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap() Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-19 18:48   ` Lorenzo Stoakes
  2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

Add a limit to the vma search instead of using the start and end of the
one passed in.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/vma.c | 6 ++++--
 mm/vma.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 3b12c7579831b..fd270345c25d3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -474,6 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
  * Called with the mm semaphore held.
  */
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		unsigned long vma_min, unsigned long vma_max,
 		struct vm_area_struct *prev, struct vm_area_struct *next)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -481,7 +482,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+	unmap_vmas(&tlb, mas, vma, vma_min, vma_max, vma_max,
 		   /* mm_wr_locked = */ true);
 	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
@@ -2417,7 +2418,8 @@ static int __mmap_new_file_vma(struct mmap_state *map,
 
 		vma_iter_set(vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
-		unmap_region(&vmi->mas, vma, map->prev, map->next);
+		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
+			     map->prev, map->next);
 
 		return error;
 	}
diff --git a/mm/vma.h b/mm/vma.h
index b123a9cdedb0d..336dae295853e 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -281,6 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		unsigned long min, unsigned long max,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
 
 /* We are about to modify the VMA's flags. */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
                   ` (2 preceding siblings ...)
  2025-08-15 19:10 ` [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-18 15:36   ` Lorenzo Stoakes
  2025-08-19 19:14   ` Lorenzo Stoakes
  2025-08-15 19:10 ` [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region() Liam R. Howlett
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

The ceiling and tree search limit need to be different arguments for the
future change in the failed fork attempt.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/internal.h | 4 +++-
 mm/memory.c   | 7 ++++---
 mm/mmap.c     | 2 +-
 mm/vma.c      | 3 ++-
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc030..f9a278ac76d83 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -444,7 +444,9 @@ void folio_activate(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *start_vma, unsigned long floor,
-		   unsigned long ceiling, bool mm_wr_locked);
+		   unsigned long ceiling, unsigned long tree_max,
+		   bool mm_wr_locked);
+
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b718471..3346514562bba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -371,7 +371,8 @@ void free_pgd_range(struct mmu_gather *tlb,
 
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *vma, unsigned long floor,
-		   unsigned long ceiling, bool mm_wr_locked)
+		   unsigned long ceiling, unsigned long tree_max,
+		   bool mm_wr_locked)
 {
 	struct unlink_vma_file_batch vb;
 
@@ -385,7 +386,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
 		 * be 0.  This will underflow and is okay.
 		 */
-		next = mas_find(mas, ceiling - 1);
+		next = mas_find(mas, tree_max - 1);
 		if (unlikely(xa_is_zero(next)))
 			next = NULL;
 
@@ -405,7 +406,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		 */
 		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
 			vma = next;
-			next = mas_find(mas, ceiling - 1);
+			next = mas_find(mas, tree_max - 1);
 			if (unlikely(xa_is_zero(next)))
 				next = NULL;
 			if (mm_wr_locked)
diff --git a/mm/mmap.c b/mm/mmap.c
index 0995a48b46d59..eba2bc81bc749 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1311,7 +1311,7 @@ void exit_mmap(struct mm_struct *mm)
 	mt_clear_in_rcu(&mm->mm_mt);
 	vma_iter_set(&vmi, vma->vm_end);
 	free_pgtables(&tlb, &vmi.mas, vma, FIRST_USER_ADDRESS,
-		      USER_PGTABLES_CEILING, true);
+		      USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
 	tlb_finish_mmu(&tlb);
 
 	/*
diff --git a/mm/vma.c b/mm/vma.c
index fd270345c25d3..aa75ca8618609 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -486,6 +486,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		   /* mm_wr_locked = */ true);
 	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+		      next ? next->vm_start : USER_PGTABLES_CEILING,
 		      next ? next->vm_start : USER_PGTABLES_CEILING,
 		      /* mm_wr_locked = */ true);
 	tlb_finish_mmu(&tlb);
@@ -1232,7 +1233,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 	mas_set(mas_detach, 1);
 	/* start and end may be different if there is no prev or next vma. */
 	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
-		      vms->unmap_end, mm_wr_locked);
+		      vms->unmap_end, vms->unmap_end, mm_wr_locked);
 	tlb_finish_mmu(&tlb);
 	vms->clear_ptes = false;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region()
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
                   ` (3 preceding siblings ...)
  2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-19 19:27   ` Lorenzo Stoakes
  2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

The unmap_region() calls need to pass through the page table limit for a
future patch.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/vma.c | 5 +++--
 mm/vma.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index aa75ca8618609..39f3b55a020b2 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -474,7 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
  * Called with the mm semaphore held.
  */
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
-		unsigned long vma_min, unsigned long vma_max,
+		unsigned long vma_min, unsigned long vma_max, unsigned long pg_max,
 		struct vm_area_struct *prev, struct vm_area_struct *next)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -487,7 +487,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 		      next ? next->vm_start : USER_PGTABLES_CEILING,
-		      next ? next->vm_start : USER_PGTABLES_CEILING,
+		      pg_max,
 		      /* mm_wr_locked = */ true);
 	tlb_finish_mmu(&tlb);
 }
@@ -2420,6 +2420,7 @@ static int __mmap_new_file_vma(struct mmap_state *map,
 		vma_iter_set(vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
 		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
+			     map->next ? map->next->vm_start : USER_PGTABLES_CEILING,
 			     map->prev, map->next);
 
 		return error;
diff --git a/mm/vma.h b/mm/vma.h
index 336dae295853e..ba203c0c1d89d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -281,7 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
-		unsigned long min, unsigned long max,
+		unsigned long min, unsigned long max, unsigned long pg_max,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
 
 /* We are about to modify the VMA's flags. */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
                   ` (4 preceding siblings ...)
  2025-08-15 19:10 ` [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region() Liam R. Howlett
@ 2025-08-15 19:10 ` Liam R. Howlett
  2025-08-18 15:12   ` Lorenzo Stoakes
  2025-08-19 20:33   ` Lorenzo Stoakes
  2025-08-15 19:49 ` [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Jann Horn
  2025-08-18  9:44 ` David Hildenbrand
  7 siblings, 2 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-15 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox, Liam R. Howlett

When the dup_mmap() fails during the vma duplication or setup, don't
write the XA_ZERO entry in the vma tree.  Instead, destroy the tree and
free the new resources, leaving an empty vma tree.

Using XA_ZERO introduced races where the vma could be found between
dup_mmap() dropping all locks and exit_mmap() taking the locks.  The
race can occur because the mm can be reached through the other trees
via successfully copied vmas and other methods such as the swapoff code.

XA_ZERO was marking the location to stop vma removal and pagetable
freeing.  The newly created arguments to the unmap_vmas() and
free_pgtables() serve this function.

Replacing the XA_ZERO entry use with the new argument list also means
the checks for xa_is_zero() are no longer necessary so these are also
removed.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/memory.c |  6 +-----
 mm/mmap.c   | 39 ++++++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3346514562bba..8cd58f171bae4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		 * be 0.  This will underflow and is okay.
 		 */
 		next = mas_find(mas, tree_max - 1);
-		if (unlikely(xa_is_zero(next)))
-			next = NULL;
 
 		/*
 		 * Hide vma from rmap and truncate_pagecache before freeing
@@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
 			vma = next;
 			next = mas_find(mas, tree_max - 1);
-			if (unlikely(xa_is_zero(next)))
-				next = NULL;
 			if (mm_wr_locked)
 				vma_start_write(vma);
 			unlink_anon_vmas(vma);
@@ -1978,7 +1974,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 				 mm_wr_locked);
 		hugetlb_zap_end(vma, &details);
 		vma = mas_find(mas, tree_end - 1);
-	} while (vma && likely(!xa_is_zero(vma)));
+	} while (vma);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index eba2bc81bc749..5acc0b5f8c14a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
 	arch_exit_mmap(mm);
 
 	vma = vma_next(&vmi);
-	if (!vma || unlikely(xa_is_zero(vma))) {
+	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
 		mmap_read_unlock(mm);
 		mmap_write_lock(mm);
@@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		ksm_fork(mm, oldmm);
 		khugepaged_fork(mm, oldmm);
 	} else {
+		unsigned long max;
 
 		/*
-		 * The entire maple tree has already been duplicated. If the
-		 * mmap duplication fails, mark the failure point with
-		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
-		 * stop releasing VMAs that have not been duplicated after this
-		 * point.
+		 * The entire maple tree has already been duplicated, but
+		 * replacing the vmas failed at mpnt (which could be NULL if
+		 * all were allocated but the last vma was not fully set up. Use
+		 * the start address of the failure point to clean up the half
+		 * initialized tree.
 		 */
-		if (mpnt) {
-			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
-			mas_store(&vmi.mas, XA_ZERO_ENTRY);
-			/* Avoid OOM iterating a broken tree */
-			set_bit(MMF_OOM_SKIP, &mm->flags);
+		if (!mm->map_count) {
+			/* zero vmas were written to the new tree. */
+			max = 0;
+		} else if (mpnt) {
+			/* mid-tree failure */
+			max = mpnt->vm_start;
+		} else {
+			/* All vmas were written to the new tree */
+			max = ULONG_MAX;
 		}
+
+		/* Hide mm from oom killer because the memory is being freed */
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		if (max) {
+			vma_iter_set(&vmi, 0);
+			tmp = vma_next(&vmi);
+			flush_cache_mm(mm);
+			unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);
+			charge = tear_down_vmas(mm, &vmi, tmp, max);
+			vm_unacct_memory(charge);
+		}
+		__mt_destroy(&mm->mm_mt);
 		/*
 		 * The mm_struct is going to exit, but the locks will be dropped
 		 * first.  Set the mm_struct as unstable is advisable as it is
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
                   ` (5 preceding siblings ...)
  2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
@ 2025-08-15 19:49 ` Jann Horn
  2025-08-18 15:48   ` Liam R. Howlett
  2025-08-18  9:44 ` David Hildenbrand
  7 siblings, 1 reply; 34+ messages in thread
From: Jann Horn @ 2025-08-15 19:49 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, Lorenzo Stoakes, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Before you read on, please take a moment to acknowledge that David
> Hildenbrand asked for this, so I'm blaming mostly him :)
>
> It is possible that the dup_mmap() call fails on allocating or setting
> up a vma after the maple tree of the oldmm is copied.  Today, that
> failure point is marked by inserting an XA_ZERO entry over the failure
> point so that the exact location does not need to be communicated
> through to exit_mmap().

Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.

> However, a race exists in the tear down process because the dup_mmap()
> drops the mmap lock before exit_mmap() can remove the partially set up
> vma tree.  This means that other tasks may get to the mm tree and find
> the invalid vma pointer (since it's an XA_ZERO entry), even though the
> mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
>
> To remove the race fully, the tree must be cleaned up before dropping
> the lock.  This is accomplished by extracting the vma cleanup in
> exit_mmap() and changing the required functions to pass through the vma
> search limit.

It really seems to me like, instead of tearing down the whole tree on
this failure path, we should be able to remove those entries in the
cloned vma tree that haven't been copied yet and then proceed as
normal. I understand that this is complicated because of maple tree
weirdness; but can't we somehow fix the wr_rebalance case to not
allocate more memory when reducing the number of tree nodes?
Surely there's some way to do that. A really stupid suggestion: As
long as wr_rebalance is guaranteed to not increase the number of
nodes, we could make do with a global-mutex-protected system-global
preallocation of significantly less than 64 maple tree nodes, right?
We could even use that in RCU mode, as long as we are willing to take
a synchronize_rcu() penalty on this "we really want to wipe some tree
elements" slowpath.

It feels like we're adding more and more weird contortions caused by
the kinda bizarre "you can't reliably remove tree elements" property
of maple trees, and I really feel like that complexity should be
pushed down into the maple tree implementation instead.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
                   ` (6 preceding siblings ...)
  2025-08-15 19:49 ` [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Jann Horn
@ 2025-08-18  9:44 ` David Hildenbrand
  2025-08-18 14:26   ` Charan Teja Kalla
  2025-08-18 15:47   ` Lorenzo Stoakes
  7 siblings, 2 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-08-18  9:44 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, Charan Teja Kalla, shikemeng, kasong,
	nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On 15.08.25 21:10, Liam R. Howlett wrote:
> Before you read on, please take a moment to acknowledge that David
> Hildenbrand asked for this, so I'm blaming mostly him :)

:)

> 
> It is possible that the dup_mmap() call fails on allocating or setting
> up a vma after the maple tree of the oldmm is copied.  Today, that
> failure point is marked by inserting an XA_ZERO entry over the failure
> point so that the exact location does not need to be communicated
> through to exit_mmap().
> 
> However, a race exists in the tear down process because the dup_mmap()
> drops the mmap lock before exit_mmap() can remove the partially set up
> vma tree.  This means that other tasks may get to the mm tree and find
> the invalid vma pointer (since it's an XA_ZERO entry), even though the
> mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> 
> To remove the race fully, the tree must be cleaned up before dropping
> the lock.  This is accomplished by extracting the vma cleanup in
> exit_mmap() and changing the required functions to pass through the vma
> search limit.
> 
> This does run the risk of increasing the possibility of finding no vmas
> (which is already possible!) in code this isn't careful.

Right, it would also happen if __mt_dup() fails I guess.

> 
> The passing of so many limits and variables was such a mess when the
> dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> entry marker, but since the swap case was the second time we've hit
> cases of walking an almost-dead mm, here's the alternative to checking
> MMF_UNSTABLE before wandering into other mm structs.

Changes look fairly small and reasonable, so I really like this.

I agree with Jann that doing a partial teardown might be even better, 
but code-wise I suspect it might end up with a lot more churn and weird 
allocation-corner-cases to handle.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-18  9:44 ` David Hildenbrand
@ 2025-08-18 14:26   ` Charan Teja Kalla
  2025-08-18 14:54     ` Liam R. Howlett
  2025-08-18 15:47   ` Lorenzo Stoakes
  1 sibling, 1 reply; 34+ messages in thread
From: Charan Teja Kalla @ 2025-08-18 14:26 UTC (permalink / raw)
  To: David Hildenbrand, Liam R. Howlett, Lorenzo Stoakes
  Cc: maple-tree, linux-mm, linux-kernel, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Andrew Morton,
	Jann Horn, Pedro Falcato, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox


On 8/18/2025 3:14 PM, David Hildenbrand wrote:
>> Before you read on, please take a moment to acknowledge that David
>> Hildenbrand asked for this, so I'm blaming mostly him 🙂
> 

Just curious If we can go by the suggestion of mm unstable check just
for the 6.12, as it is LTS.  I can't see this issue on branches earlier
to it. As well this patchset is not getting applied cleanly on 6.12.

> 🙂



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-18 14:26   ` Charan Teja Kalla
@ 2025-08-18 14:54     ` Liam R. Howlett
  0 siblings, 0 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-18 14:54 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: David Hildenbrand, Lorenzo Stoakes, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Charan Teja Kalla <quic_charante@quicinc.com> [250818 10:27]:
> 
> On 8/18/2025 3:14 PM, David Hildenbrand wrote:
> >> Before you read on, please take a moment to acknowledge that David
> >> Hildenbrand asked for this, so I'm blaming mostly him 🙂
> > 
> 
> Just curious If we can go by the suggestion of mm unstable check just
> for the 6.12, as it is LTS.  I can't see this issue on branches earlier
> to it. As well this patchset is not getting applied cleanly on 6.12.

This was developed against mm-new to be applied and potentially
backported, but it's an RFC right now so it won't be applied anywhere.

I think you still should be checking the MMF_UNSTABLE in the swap path,
as David suggested as a hotfix [1].  Considering it's a single mm global
check before iterating, it should be checked regardless of what happens
with this RFC.

Thanks,
Liam

[1].  https://lore.kernel.org/linux-mm/9178bf98-2ea7-4ad8-ad43-cdcc02ab863d@redhat.com/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
@ 2025-08-18 15:12   ` Lorenzo Stoakes
  2025-08-18 15:29     ` Lorenzo Stoakes
  2025-08-19 20:33   ` Lorenzo Stoakes
  1 sibling, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-18 15:12 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

It seems your series clashes with my mm flags change, and this patch is a
victim, it won't apply against mm-new at all :(


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-08-18 15:12   ` Lorenzo Stoakes
@ 2025-08-18 15:29     ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-18 15:29 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

On Mon, Aug 18, 2025 at 04:12:59PM +0100, Lorenzo Stoakes wrote:
> It seems your series clashes with my mm flags change, and this patch is a
> victim, it won't apply against mm-new at all :(

I rebased it/fixed mm flags stuff locally (manually) so I could test, so I
thought - why not actually share that here to make your life easier :P

It's attached, if I didn't mess up with neomutt (big IF)

Cheers, Lorenzo

[-- Attachment #2: 0006_mm_change_dup_mmap_recovery_REBASED.patch --]
[-- Type: text/plain, Size: 4327 bytes --]

From 16b7f9d7042de8cb7d6f86b815d3075cedcf1132 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Date: Mon, 18 Aug 2025 16:19:46 +0100
Subject: [PATCH RFC 6/6] mm: change dup_mmap() recovery

When the dup_mmap() fails during the vma duplication or setup, don't
write the XA_ZERO entry in the vma tree.  Instead, destroy the tree and
free the new resources, leaving an empty vma tree.

Using XA_ZERO introduced races where the vma could be found between
dup_mmap() dropping all locks and exit_mmap() taking the locks.  The
race can occur because the mm can be reached through the other trees
via successfully copied vmas and other methods such as the swapoff code.

XA_ZERO was marking the location to stop vma removal and pagetable
freeing.  The newly created arguments to the unmap_vmas() and
free_pgtables() serve this function.

Replacing the XA_ZERO entry use with the new argument list also means
the checks for xa_is_zero() are no longer necessary so these are also
removed.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/memory.c |  6 +-----
 mm/mmap.c   | 39 ++++++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9f584c5e4f43..218b496dc9ad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		 * be 0.  This will underflow and is okay.
 		 */
 		next = mas_find(mas, tree_max - 1);
-		if (unlikely(xa_is_zero(next)))
-			next = NULL;

 		/*
 		 * Hide vma from rmap and truncate_pagecache before freeing
@@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
 			vma = next;
 			next = mas_find(mas, tree_max - 1);
-			if (unlikely(xa_is_zero(next)))
-				next = NULL;
 			if (mm_wr_locked)
 				vma_start_write(vma);
 			unlink_anon_vmas(vma);
@@ -2106,7 +2102,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 				 mm_wr_locked);
 		hugetlb_zap_end(vma, &details);
 		vma = mas_find(mas, tree_end - 1);
-	} while (vma && likely(!xa_is_zero(vma)));
+	} while (vma);
 	mmu_notifier_invalidate_range_end(&range);
 }

diff --git a/mm/mmap.c b/mm/mmap.c
index b49df82817ba..ecfeb1987156 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
 	arch_exit_mmap(mm);

 	vma = vma_next(&vmi);
-	if (!vma || unlikely(xa_is_zero(vma))) {
+	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
 		mmap_read_unlock(mm);
 		mmap_write_lock(mm);
@@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		ksm_fork(mm, oldmm);
 		khugepaged_fork(mm, oldmm);
 	} else {
+		unsigned long max;

 		/*
-		 * The entire maple tree has already been duplicated. If the
-		 * mmap duplication fails, mark the failure point with
-		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
-		 * stop releasing VMAs that have not been duplicated after this
-		 * point.
+		 * The entire maple tree has already been duplicated, but
+		 * replacing the vmas failed at mpnt (which could be NULL if
+		 * all were allocated but the last vma was not fully set up. Use
+		 * the start address of the failure point to clean up the half
+		 * initialized tree.
 		 */
-		if (mpnt) {
-			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
-			mas_store(&vmi.mas, XA_ZERO_ENTRY);
-			/* Avoid OOM iterating a broken tree */
-			mm_flags_set(MMF_OOM_SKIP, mm);
+		if (!mm->map_count) {
+			/* zero vmas were written to the new tree. */
+			max = 0;
+		} else if (mpnt) {
+			/* mid-tree failure */
+			max = mpnt->vm_start;
+		} else {
+			/* All vmas were written to the new tree */
+			max = ULONG_MAX;
 		}
+
+		/* Hide mm from oom killer because the memory is being freed */
+		mm_flags_set(MMF_OOM_SKIP, mm);
+		if (max) {
+			vma_iter_set(&vmi, 0);
+			tmp = vma_next(&vmi);
+			flush_cache_mm(mm);
+			unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);
+			charge = tear_down_vmas(mm, &vmi, tmp, max);
+			vm_unacct_memory(charge);
+		}
+		__mt_destroy(&mm->mm_mt);
 		/*
 		 * The mm_struct is going to exit, but the locks will be dropped
 		 * first.  Set the mm_struct as unstable is advisable as it is
--
2.50.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
@ 2025-08-18 15:36   ` Lorenzo Stoakes
  2025-08-18 15:54     ` Liam R. Howlett
  2025-08-19 19:14   ` Lorenzo Stoakes
  1 sibling, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-18 15:36 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

I know this is an RFC but to be mildly annoying, you need to also update
tools/testing/vma/vma_internal.h to reflect this as this causes those tests
to break.

On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
> The ceiling and tree search limit need to be different arguments for the
> future change in the failed fork attempt.
>
> No functional changes intended.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/internal.h | 4 +++-
>  mm/memory.c   | 7 ++++---
>  mm/mmap.c     | 2 +-
>  mm/vma.c      | 3 ++-
>  4 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 45b725c3dc030..f9a278ac76d83 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -444,7 +444,9 @@ void folio_activate(struct folio *folio);
>
>  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		   struct vm_area_struct *start_vma, unsigned long floor,
> -		   unsigned long ceiling, bool mm_wr_locked);
> +		   unsigned long ceiling, unsigned long tree_max,
> +		   bool mm_wr_locked);
> +
>  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
>
>  struct zap_details;
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b718471..3346514562bba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -371,7 +371,8 @@ void free_pgd_range(struct mmu_gather *tlb,
>
>  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		   struct vm_area_struct *vma, unsigned long floor,
> -		   unsigned long ceiling, bool mm_wr_locked)
> +		   unsigned long ceiling, unsigned long tree_max,
> +		   bool mm_wr_locked)
>  {
>  	struct unlink_vma_file_batch vb;
>
> @@ -385,7 +386,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
>  		 * be 0.  This will underflow and is okay.
>  		 */
> -		next = mas_find(mas, ceiling - 1);
> +		next = mas_find(mas, tree_max - 1);
>  		if (unlikely(xa_is_zero(next)))
>  			next = NULL;
>
> @@ -405,7 +406,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		 */
>  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
>  			vma = next;
> -			next = mas_find(mas, ceiling - 1);
> +			next = mas_find(mas, tree_max - 1);
>  			if (unlikely(xa_is_zero(next)))
>  				next = NULL;
>  			if (mm_wr_locked)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0995a48b46d59..eba2bc81bc749 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1311,7 +1311,7 @@ void exit_mmap(struct mm_struct *mm)
>  	mt_clear_in_rcu(&mm->mm_mt);
>  	vma_iter_set(&vmi, vma->vm_end);
>  	free_pgtables(&tlb, &vmi.mas, vma, FIRST_USER_ADDRESS,
> -		      USER_PGTABLES_CEILING, true);
> +		      USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
>  	tlb_finish_mmu(&tlb);
>
>  	/*
> diff --git a/mm/vma.c b/mm/vma.c
> index fd270345c25d3..aa75ca8618609 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -486,6 +486,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  		   /* mm_wr_locked = */ true);
>  	mas_set(mas, vma->vm_end);
>  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +		      next ? next->vm_start : USER_PGTABLES_CEILING,
>  		      next ? next->vm_start : USER_PGTABLES_CEILING,
>  		      /* mm_wr_locked = */ true);
>  	tlb_finish_mmu(&tlb);
> @@ -1232,7 +1233,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
>  	mas_set(mas_detach, 1);
>  	/* start and end may be different if there is no prev or next vma. */
>  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
> -		      vms->unmap_end, mm_wr_locked);
> +		      vms->unmap_end, vms->unmap_end, mm_wr_locked);
>  	tlb_finish_mmu(&tlb);
>  	vms->clear_ptes = false;
>  }
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-18  9:44 ` David Hildenbrand
  2025-08-18 14:26   ` Charan Teja Kalla
@ 2025-08-18 15:47   ` Lorenzo Stoakes
  1 sibling, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-18 15:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam R. Howlett, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Mon, Aug 18, 2025 at 11:44:16AM +0200, David Hildenbrand wrote:
> On 15.08.25 21:10, Liam R. Howlett wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
>
> :)
>
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied.  Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> >
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree.  This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock.  This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> >
> > This does run the risk of increasing the possibility of finding no vmas
> > (which is already possible!) in code this isn't careful.
>
> Right, it would also happen if __mt_dup() fails I guess.
>
> >
> > The passing of so many limits and variables was such a mess when the
> > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> > entry marker, but since the swap case was the second time we've hit
> > cases of walking an almost-dead mm, here's the alternative to checking
> > MMF_UNSTABLE before wandering into other mm structs.
>
> Changes look fairly small and reasonable, so I really like this.
>
> I agree with Jann that doing a partial teardown might be even better, but
> code-wise I suspect it might end up with a lot more churn and weird
> allocation-corner-cases to handle.

I've yet to review the series and see exactly what's proposed but on gut
instinct (and based on past experience with the munmap gather/reattach
stuff), some kind of a partial thing like this tends to end up a nightmare
of weird-stuff-you-didn't-think-about.

So I'm instincitively against this.

However I'll take a proper look through this series shortly and hopefully
have more intelligent things to say...

An aside - I was working on a crazy anon idea over the weekend (I know, I
know) and noticed that mm life cycle is just weird. I observed apparent
duplicate calls of __mmdrop() for instance (I think the unwinding just
broke), the delayed mmdrop is strange and the whole area seems rife with
complexity.

So I'm glad David talked you into doing this ;) this particular edge case
was always strange and the fact we have now hid it twice (and this time
more seriously - as it's due to a fatal signal which is much more likely to
arise than an OOM scenario with too-small-to-fail allocations).

BTW where are we with the hotfix for the swapoff case [0]? I think we
agreed settng on MMF_UNSTABLE there and using that to decide not to proceed
in unuse_mm() right?

Cheers, Lorenzo

[0]: https://lore.kernel.org/all/20250808092156.1918973-1-quic_charante@quicinc.com/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
  2025-08-15 19:49 ` [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Jann Horn
@ 2025-08-18 15:48   ` Liam R. Howlett
  0 siblings, 0 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-18 15:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Lorenzo Stoakes, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Jann Horn <jannh@google.com> [250815 15:50]:
> On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied.  Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> 
> Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.
> 
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree.  This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock.  This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> 
> It really seems to me like, instead of tearing down the whole tree on
> this failure path, we should be able to remove those entries in the
> cloned vma tree that haven't been copied yet and then proceed as
> normal. I understand that this is complicated because of maple tree
> weirdness; but can't we somehow fix the wr_rebalance case to not
> allocate more memory when reducing the number of tree nodes?
> Surely there's some way to do that. A really stupid suggestion: As
> long as wr_rebalance is guaranteed to not increase the number of
> nodes, we could make do with a global-mutex-protected system-global
> preallocation of significantly less than 64 maple tree nodes, right?
> We could even use that in RCU mode, as long as we are willing to take
> a synchronize_rcu() penalty on this "we really want to wipe some tree
> elements" slowpath.
> 
> It feels like we're adding more and more weird contortions caused by
> the kinda bizarre "you can't reliably remove tree elements" property
> of maple trees, and I really feel like that complexity should be
> pushed down into the maple tree implementation instead.

First, thanks for looking at this.  I appreciate you taking the time to
think through what is going on here and alternatives.

The maple tree isn't the first time that we need memory to free memory
and I don't think it'll be the last either.  I have a method of reusing
nodes, but that's when rcu is not on.  What you are suggesting is very
complicated and will have a number of corner cases, and probably zero
users.

Having a global reserve of nodes is something that has come up before,
but there is no way to ensure that we'd have enough to ensure that
things won't fail.  64 is a huge number of nodes, but there's no way to
know if we're going to hit this situation multiple times in a row in
rapid succession.  And since we cannot depending on it working, then
there is no real benefit to having the mode - we have to plan for the
rare case of failure regardless.

There is also another project to reduce the possibility of the maple
tree itself failing to allocate [1], but in the dup_mmap() situation the
tree is not the problem.   What you are asking me to do is to add a
special mode in the tree to work around the fact that the mm struct has
a life cycle problem.

If this special mode already existed as you have suggested above, I
would advise not using it here because we are out of memory and we want
to get memory back from the failed dup_mmap() as quickly as possible,
and the tear down is the fastest way to do that.  I don't want to have
code executing to juggle things around to fix an mm struct so that the
unlock/lock dance is safe for random race conditions.

This is why I think the mode would have zero users, dup_mmap() shouldn't
be doing any extra work after a failed allocation.

I see where you are coming from and that having compartmentalised code
feels like the right path, but in this case the system failed to
allocate enough memory after reclaim ran and we're looking at adding
extra run time for zero benefit.  From the maple tree perspective, there
is one call to destroy the tree, the rest of the code is to clean up the
failed mm setup.

On a higher level, I'm disappointed that the mm struct is half
initialized and unlocked at all.  Really, I think we should go further
and not have the mm exposed for any window of time.  The issue with just
calling exit_mmap() directly is the unlocking/locking dance for
performance reasons.

Equally, the !vma issue seems crazy because it's about avoiding races
where the task has unmapped everything and we really shouldn't be able
to iterate over tasks that are in this state.

And we have this MMF_UNSTABLE bit in the mm, but other tasks walk into
the mm and access the struct without checking.  That seems suboptimal,
at best.

[1] https://lore.kernel.org/all/20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz/ 

Thanks,
Liam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-08-18 15:36   ` Lorenzo Stoakes
@ 2025-08-18 15:54     ` Liam R. Howlett
  0 siblings, 0 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-08-18 15:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250818 11:36]:
> I know this is an RFC but to be mildly annoying, you need to also update
> tools/testing/vma/vma_internal.h to reflect this as this causes those tests
> to break.

Sounds good.  I did check that, but on an earlier version with another
vma testing fix.

I'll check again before sending out another version.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point
  2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
@ 2025-08-19 18:27   ` Lorenzo Stoakes
  2025-08-21 21:12   ` Chris Li
  1 sibling, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 18:27 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:26PM -0400, Liam R. Howlett wrote:
> Move the trace point later in the function so that it is not skipped in
> the event of a failed fork.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7306253cc3b57..c4c315b480af7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1310,9 +1310,9 @@ void exit_mmap(struct mm_struct *mm)
>
>  	BUG_ON(count != mm->map_count);
>
> -	trace_exit_mmap(mm);
>  destroy:
>  	__mt_destroy(&mm->mm_mt);
> +	trace_exit_mmap(mm);
>  	mmap_write_unlock(mm);
>  	vm_unacct_memory(nr_accounted);
>  }
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap()
  2025-08-15 19:10 ` [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap() Liam R. Howlett
@ 2025-08-19 18:38   ` Lorenzo Stoakes
  2025-09-03 19:56     ` Liam R. Howlett
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 18:38 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:27PM -0400, Liam R. Howlett wrote:
> Create the new function tear_down_vmas() to remove a range of vmas.
> exit_mmap() will be removing all the vmas.
>
> This is necessary for future patches.
>
> No functional changes intended.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

This function is pure and complete insanity, but this change looks
good. Couple nits below.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mmap.c | 37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c4c315b480af7..0995a48b46d59 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1250,6 +1250,29 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags)
>  }
>  EXPORT_SYMBOL(vm_brk_flags);
>
> +static inline
> +unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
> +		struct vm_area_struct *vma, unsigned long max)
> +{
> +	unsigned long nr_accounted = 0;
> +	int count = 0;
> +
> +	mmap_assert_write_locked(mm);
> +	vma_iter_set(vmi, vma->vm_end);
> +	do {
> +		if (vma->vm_flags & VM_ACCOUNT)
> +			nr_accounted += vma_pages(vma);
> +		vma_mark_detached(vma);
> +		remove_vma(vma);
> +		count++;
> +		cond_resched();
> +		vma = vma_next(vmi);
> +	} while (vma && vma->vm_end <= max);
> +
> +	BUG_ON(count != mm->map_count);

Can we make this a WARN_ON() or WARN_ON_ONCE() while we're at it?

> +	return nr_accounted;
> +}
> +
>  /* Release all mmaps. */
>  void exit_mmap(struct mm_struct *mm)
>  {
> @@ -1257,7 +1280,6 @@ void exit_mmap(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;

No need to initialise this to 0 any more.

>  	VMA_ITERATOR(vmi, mm, 0);
> -	int count = 0;
>
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
> @@ -1297,18 +1319,7 @@ void exit_mmap(struct mm_struct *mm)
>  	 * enabled, without holding any MM locks besides the unreachable
>  	 * mmap_write_lock.
>  	 */
> -	vma_iter_set(&vmi, vma->vm_end);
> -	do {
> -		if (vma->vm_flags & VM_ACCOUNT)
> -			nr_accounted += vma_pages(vma);
> -		vma_mark_detached(vma);
> -		remove_vma(vma);
> -		count++;
> -		cond_resched();
> -		vma = vma_next(&vmi);
> -	} while (vma && likely(!xa_is_zero(vma)));
> -
> -	BUG_ON(count != mm->map_count);
> +	nr_accounted = tear_down_vmas(mm, &vmi, vma, ULONG_MAX);
>
>  destroy:
>  	__mt_destroy(&mm->mm_mt);
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas
  2025-08-15 19:10 ` [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas Liam R. Howlett
@ 2025-08-19 18:48   ` Lorenzo Stoakes
  2025-09-03 19:57     ` Liam R. Howlett
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 18:48 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:28PM -0400, Liam R. Howlett wrote:
> Add a limit to the vma search instead of using the start and end of the
> one passed in.
>
> No functional changes intended.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/vma.c | 6 ++++--
>  mm/vma.h | 1 +
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 3b12c7579831b..fd270345c25d3 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -474,6 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
>   * Called with the mm semaphore held.
>   */
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> +		unsigned long vma_min, unsigned long vma_max,
>  		struct vm_area_struct *prev, struct vm_area_struct *next)

Luckily we will be getting rid of this as soon as I complete the mmap_prepare +
friends series (_actively_ working on this right now).

Anyway, this is getting quickly horrid, since this is only in vma.c, we could
make this static and pass through map to reduce some of the horror?

We actually can hand it vmi (and thus &vmi->mas via map, so we could implement
it as:

void unmap_region(struct mmap_state *map, struct vm_area_struct *vma,
		  unsigned long min, unsigned long max);

>  {
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -481,7 +482,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>
>  	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
> -	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
> +	unmap_vmas(&tlb, mas, vma, vma_min, vma_max, vma_max,
>  		   /* mm_wr_locked = */ true);
>  	mas_set(mas, vma->vm_end);
>  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> @@ -2417,7 +2418,8 @@ static int __mmap_new_file_vma(struct mmap_state *map,
>
>  		vma_iter_set(vmi, vma->vm_end);
>  		/* Undo any partial mapping done by a device driver. */
> -		unmap_region(&vmi->mas, vma, map->prev, map->next);
> +		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
> +			     map->prev, map->next);
>
>  		return error;
>  	}
> diff --git a/mm/vma.h b/mm/vma.h
> index b123a9cdedb0d..336dae295853e 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -281,6 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  void remove_vma(struct vm_area_struct *vma);
>
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> +		unsigned long min, unsigned long max,
>  		struct vm_area_struct *prev, struct vm_area_struct *next);
>
>  /* We are about to modify the VMA's flags. */
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
  2025-08-18 15:36   ` Lorenzo Stoakes
@ 2025-08-19 19:14   ` Lorenzo Stoakes
  2025-09-03 20:19     ` Liam R. Howlett
  1 sibling, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 19:14 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
> The ceiling and tree search limit need to be different arguments for the
> future change in the failed fork attempt.
>
> No functional changes intended.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

(Obv. in addition to comment about broken VMA tests :P)

I guess intent is that if we discover any page tables beyond tree_max then
we ought to just wipe them all out so, in effect, we don't consider
mappings at or past tree_max to be valid?

I feel like we need a comment to this effect as this is confusing as it is.

Could we add a kerneldoc comment for free_pgtables() spelling this out?

> ---
>  mm/internal.h | 4 +++-
>  mm/memory.c   | 7 ++++---
>  mm/mmap.c     | 2 +-
>  mm/vma.c      | 3 ++-
>  4 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 45b725c3dc030..f9a278ac76d83 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -444,7 +444,9 @@ void folio_activate(struct folio *folio);
>
>  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		   struct vm_area_struct *start_vma, unsigned long floor,
> -		   unsigned long ceiling, bool mm_wr_locked);
> +		   unsigned long ceiling, unsigned long tree_max,
> +		   bool mm_wr_locked);
> +
>  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
>
>  struct zap_details;
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b718471..3346514562bba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -371,7 +371,8 @@ void free_pgd_range(struct mmu_gather *tlb,
>
>  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		   struct vm_area_struct *vma, unsigned long floor,
> -		   unsigned long ceiling, bool mm_wr_locked)
> +		   unsigned long ceiling, unsigned long tree_max,
> +		   bool mm_wr_locked)
>  {
>  	struct unlink_vma_file_batch vb;
>
> @@ -385,7 +386,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
>  		 * be 0.  This will underflow and is okay.
>  		 */
> -		next = mas_find(mas, ceiling - 1);
> +		next = mas_find(mas, tree_max - 1);

Do we need to put some sort of sanity checks in to make sure tree_max <= ceiling
(though this 0 case is a pain... so I guess tree_max - 1 <= ceiling - 1?)

>  		if (unlikely(xa_is_zero(next)))
>  			next = NULL;
>
> @@ -405,7 +406,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		 */
>  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
>  			vma = next;
> -			next = mas_find(mas, ceiling - 1);
> +			next = mas_find(mas, tree_max - 1);
>  			if (unlikely(xa_is_zero(next)))
>  				next = NULL;
>  			if (mm_wr_locked)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0995a48b46d59..eba2bc81bc749 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1311,7 +1311,7 @@ void exit_mmap(struct mm_struct *mm)
>  	mt_clear_in_rcu(&mm->mm_mt);
>  	vma_iter_set(&vmi, vma->vm_end);
>  	free_pgtables(&tlb, &vmi.mas, vma, FIRST_USER_ADDRESS,
> -		      USER_PGTABLES_CEILING, true);
> +		      USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
>  	tlb_finish_mmu(&tlb);
>
>  	/*
> diff --git a/mm/vma.c b/mm/vma.c
> index fd270345c25d3..aa75ca8618609 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -486,6 +486,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  		   /* mm_wr_locked = */ true);
>  	mas_set(mas, vma->vm_end);
>  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +		      next ? next->vm_start : USER_PGTABLES_CEILING,
>  		      next ? next->vm_start : USER_PGTABLES_CEILING,
>  		      /* mm_wr_locked = */ true);
>  	tlb_finish_mmu(&tlb);
> @@ -1232,7 +1233,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
>  	mas_set(mas_detach, 1);
>  	/* start and end may be different if there is no prev or next vma. */
>  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
> -		      vms->unmap_end, mm_wr_locked);
> +		      vms->unmap_end, vms->unmap_end, mm_wr_locked);
>  	tlb_finish_mmu(&tlb);
>  	vms->clear_ptes = false;
>  }
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region()
  2025-08-15 19:10 ` [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region() Liam R. Howlett
@ 2025-08-19 19:27   ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 19:27 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:30PM -0400, Liam R. Howlett wrote:
> The unmap_region() calls need to pass through the page table limit for a
> future patch.
>
> No functional changes intended.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

This functionally LGTM, but see comment below... about a comment :P

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/vma.c | 5 +++--
>  mm/vma.h | 2 +-
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index aa75ca8618609..39f3b55a020b2 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -474,7 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
>   * Called with the mm semaphore held.
>   */
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> -		unsigned long vma_min, unsigned long vma_max,
> +		unsigned long vma_min, unsigned long vma_max, unsigned long pg_max,
>  		struct vm_area_struct *prev, struct vm_area_struct *next)

See previous comments about refactoring unmap_region(), also let's have a
comment explaining what's going on with these parameters too.

>  {
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -487,7 +487,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  	mas_set(mas, vma->vm_end);
>  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>  		      next ? next->vm_start : USER_PGTABLES_CEILING,
> -		      next ? next->vm_start : USER_PGTABLES_CEILING,
> +		      pg_max,
>  		      /* mm_wr_locked = */ true);
>  	tlb_finish_mmu(&tlb);
>  }
> @@ -2420,6 +2420,7 @@ static int __mmap_new_file_vma(struct mmap_state *map,
>  		vma_iter_set(vmi, vma->vm_end);
>  		/* Undo any partial mapping done by a device driver. */
>  		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
> +			     map->next ? map->next->vm_start : USER_PGTABLES_CEILING,
>  			     map->prev, map->next);
>
>  		return error;
> diff --git a/mm/vma.h b/mm/vma.h
> index 336dae295853e..ba203c0c1d89d 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -281,7 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  void remove_vma(struct vm_area_struct *vma);
>
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> -		unsigned long min, unsigned long max,
> +		unsigned long min, unsigned long max, unsigned long pg_max,
>  		struct vm_area_struct *prev, struct vm_area_struct *next);
>
>  /* We are about to modify the VMA's flags. */
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
  2025-08-18 15:12   ` Lorenzo Stoakes
@ 2025-08-19 20:33   ` Lorenzo Stoakes
  2025-09-04  0:13     ` Liam R. Howlett
  1 sibling, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-19 20:33 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Fri, Aug 15, 2025 at 03:10:31PM -0400, Liam R. Howlett wrote:
> When the dup_mmap() fails during the vma duplication or setup, don't
> write the XA_ZERO entry in the vma tree.  Instead, destroy the tree and
> free the new resources, leaving an empty vma tree.

Yesss like it!

>
> Using XA_ZERO introduced races where the vma could be found between
> dup_mmap() dropping all locks and exit_mmap() taking the locks.  The
> race can occur because the mm can be reached through the other trees
> via successfully copied vmas and other methods such as the swapoff code.

Yeah god.

>
> XA_ZERO was marking the location to stop vma removal and pagetable
> freeing.  The newly created arguments to the unmap_vmas() and
> free_pgtables() serve this function.

Nice.

>
> Replacing the XA_ZERO entry use with the new argument list also means
> the checks for xa_is_zero() are no longer necessary so these are also
> removed.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Ah finally the 'action' patch :)

Obviously my review is on the basis that you fixup the rebase etc.

This broadly looks right but some various nits etc. below and will have
another look through on new revision - this whole area is pretty crazy!

> ---
>  mm/memory.c |  6 +-----
>  mm/mmap.c   | 39 ++++++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3346514562bba..8cd58f171bae4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		 * be 0.  This will underflow and is okay.
>  		 */
>  		next = mas_find(mas, tree_max - 1);
> -		if (unlikely(xa_is_zero(next)))
> -			next = NULL;
>
>  		/*
>  		 * Hide vma from rmap and truncate_pagecache before freeing
> @@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
>  			vma = next;
>  			next = mas_find(mas, tree_max - 1);
> -			if (unlikely(xa_is_zero(next)))
> -				next = NULL;
>  			if (mm_wr_locked)
>  				vma_start_write(vma);
>  			unlink_anon_vmas(vma);
> @@ -1978,7 +1974,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  				 mm_wr_locked);
>  		hugetlb_zap_end(vma, &details);
>  		vma = mas_find(mas, tree_end - 1);
> -	} while (vma && likely(!xa_is_zero(vma)));
> +	} while (vma);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index eba2bc81bc749..5acc0b5f8c14a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
>  	arch_exit_mmap(mm);
>
>  	vma = vma_next(&vmi);
> -	if (!vma || unlikely(xa_is_zero(vma))) {
> +	if (!vma) {
>  		/* Can happen if dup_mmap() received an OOM */
>  		mmap_read_unlock(mm);
>  		mmap_write_lock(mm);
> @@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  		ksm_fork(mm, oldmm);
>  		khugepaged_fork(mm, oldmm);
>  	} else {
> +		unsigned long max;
>
>  		/*
> -		 * The entire maple tree has already been duplicated. If the
> -		 * mmap duplication fails, mark the failure point with
> -		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> -		 * stop releasing VMAs that have not been duplicated after this
> -		 * point.
> +		 * The entire maple tree has already been duplicated, but
> +		 * replacing the vmas failed at mpnt (which could be NULL if
> +		 * all were allocated but the last vma was not fully set up. Use

Missing ')'.

> +		 * the start address of the failure point to clean up the half
> +		 * initialized tree.

NIT: Is 'half' correct? Maybe 'partially'?

>  		 */
> -		if (mpnt) {
> -			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> -			mas_store(&vmi.mas, XA_ZERO_ENTRY);
> -			/* Avoid OOM iterating a broken tree */
> -			set_bit(MMF_OOM_SKIP, &mm->flags);
> +		if (!mm->map_count) {
> +			/* zero vmas were written to the new tree. */
> +			max = 0;

OK I guess this then will result in the intentional underflow thing, maybe
worth mentioning?

> +		} else if (mpnt) {
> +			/* mid-tree failure */

Partial?

> +			max = mpnt->vm_start;
> +		} else {
> +			/* All vmas were written to the new tree */
> +			max = ULONG_MAX;
>  		}
> +
> +		/* Hide mm from oom killer because the memory is being freed */
> +		set_bit(MMF_OOM_SKIP, &mm->flags);

Obv. update to mm_flags_set(MMF_OOM_SKIP, mm);

> +		if (max) {
> +			vma_iter_set(&vmi, 0);
> +			tmp = vma_next(&vmi);
> +			flush_cache_mm(mm);
> +			unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);

(An aside for the unmap_region() impl, maybe let's name the pg_max as
tree_max to make it consistent across both?)

Hm we could still use the mmap struct here if we put in vma.h. Just have to
set vmi, obv prev, next NULL.

So:

	struct mmap_state map {
		.vmi = &vmi,
	}

	unmap_region(&map, tmp, 0, max, max);

Possibly overkill and hopefully stack ok but makes other callers less
horrid.

Maybe also good to add a comment spelling out what each of these params do.


> +			charge = tear_down_vmas(mm, &vmi, tmp, max);
> +			vm_unacct_memory(charge);
> +		}
> +		__mt_destroy(&mm->mm_mt);
>  		/*
>  		 * The mm_struct is going to exit, but the locks will be dropped
>  		 * first.  Set the mm_struct as unstable is advisable as it is
> --
> 2.47.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point
  2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
  2025-08-19 18:27   ` Lorenzo Stoakes
@ 2025-08-21 21:12   ` Chris Li
  1 sibling, 0 replies; 34+ messages in thread
From: Chris Li @ 2025-08-21 21:12 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: David Hildenbrand, Lorenzo Stoakes, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	Matthew Wilcox

Acked-by: Chris Li <chrisl@kernel.org>

Chris


Chris

On Fri, Aug 15, 2025 at 12:11 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> Move the trace point later in the function so that it is not skipped in
> the event of a failed fork.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7306253cc3b57..c4c315b480af7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1310,9 +1310,9 @@ void exit_mmap(struct mm_struct *mm)
>
>         BUG_ON(count != mm->map_count);
>
> -       trace_exit_mmap(mm);
>  destroy:
>         __mt_destroy(&mm->mm_mt);
> +       trace_exit_mmap(mm);
>         mmap_write_unlock(mm);
>         vm_unacct_memory(nr_accounted);
>  }
> --
> 2.47.2
>
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap()
  2025-08-19 18:38   ` Lorenzo Stoakes
@ 2025-09-03 19:56     ` Liam R. Howlett
  2025-09-04 15:21       ` Lorenzo Stoakes
  0 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-09-03 19:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 14:38]:
> On Fri, Aug 15, 2025 at 03:10:27PM -0400, Liam R. Howlett wrote:
> > Create the new function tear_down_vmas() to remove a range of vmas.
> > exit_mmap() will be removing all the vmas.
> >
> > This is necessary for future patches.
> >
> > No functional changes intended.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> 
> This function is pure and complete insanity, but this change looks
> good. Couple nits below.
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> > ---
> >  mm/mmap.c | 37 ++++++++++++++++++++++++-------------
> >  1 file changed, 24 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c4c315b480af7..0995a48b46d59 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1250,6 +1250,29 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags)
> >  }
> >  EXPORT_SYMBOL(vm_brk_flags);
> >
> > +static inline
> > +unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
> > +		struct vm_area_struct *vma, unsigned long max)
> > +{
> > +	unsigned long nr_accounted = 0;
> > +	int count = 0;
> > +
> > +	mmap_assert_write_locked(mm);
> > +	vma_iter_set(vmi, vma->vm_end);
> > +	do {
> > +		if (vma->vm_flags & VM_ACCOUNT)
> > +			nr_accounted += vma_pages(vma);
> > +		vma_mark_detached(vma);
> > +		remove_vma(vma);
> > +		count++;
> > +		cond_resched();
> > +		vma = vma_next(vmi);
> > +	} while (vma && vma->vm_end <= max);
> > +
> > +	BUG_ON(count != mm->map_count);
> 
> Can we make this a WARN_ON() or WARN_ON_ONCE() while we're at it?

Sure!

> 
> > +	return nr_accounted;
> > +}
> > +
> >  /* Release all mmaps. */
> >  void exit_mmap(struct mm_struct *mm)
> >  {
> > @@ -1257,7 +1280,6 @@ void exit_mmap(struct mm_struct *mm)
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_accounted = 0;
> 
> No need to initialise this to 0 any more.

There is a goto label below that skips calling the tear down, so this is
still needed.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas
  2025-08-19 18:48   ` Lorenzo Stoakes
@ 2025-09-03 19:57     ` Liam R. Howlett
  2025-09-04 15:23       ` Lorenzo Stoakes
  0 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-09-03 19:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 14:48]:
> On Fri, Aug 15, 2025 at 03:10:28PM -0400, Liam R. Howlett wrote:
> > Add a limit to the vma search instead of using the start and end of the
> > one passed in.
> >
> > No functional changes intended.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/vma.c | 6 ++++--
> >  mm/vma.h | 1 +
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 3b12c7579831b..fd270345c25d3 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -474,6 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
> >   * Called with the mm semaphore held.
> >   */
> >  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > +		unsigned long vma_min, unsigned long vma_max,
> >  		struct vm_area_struct *prev, struct vm_area_struct *next)
> 
> Luckily we will be getting rid of this as soon as I complete the mmap_prepare +
> friends series (_actively_ working on this right now).
> 
> Anyway, this is getting quickly horrid, since this is only in vma.c, we could
> make this static and pass through map to reduce some of the horror?
> 
> We actually can hand it vmi (and thus &vmi->mas via map, so we could implement
> it as:
> 
> void unmap_region(struct mmap_state *map, struct vm_area_struct *vma,
> 		  unsigned long min, unsigned long max);
> 

Ah... I could do this on top of the last patch.. that might be more
clear?

> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> > @@ -481,7 +482,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> >
> >  	tlb_gather_mmu(&tlb, mm);
> >  	update_hiwater_rss(mm);
> > -	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
> > +	unmap_vmas(&tlb, mas, vma, vma_min, vma_max, vma_max,
> >  		   /* mm_wr_locked = */ true);
> >  	mas_set(mas, vma->vm_end);
> >  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > @@ -2417,7 +2418,8 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> >
> >  		vma_iter_set(vmi, vma->vm_end);
> >  		/* Undo any partial mapping done by a device driver. */
> > -		unmap_region(&vmi->mas, vma, map->prev, map->next);
> > +		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
> > +			     map->prev, map->next);
> >
> >  		return error;
> >  	}
> > diff --git a/mm/vma.h b/mm/vma.h
> > index b123a9cdedb0d..336dae295853e 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -281,6 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> >  void remove_vma(struct vm_area_struct *vma);
> >
> >  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > +		unsigned long min, unsigned long max,
> >  		struct vm_area_struct *prev, struct vm_area_struct *next);
> >
> >  /* We are about to modify the VMA's flags. */
> > --
> > 2.47.2
> >


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-08-19 19:14   ` Lorenzo Stoakes
@ 2025-09-03 20:19     ` Liam R. Howlett
  2025-09-04 10:20       ` David Hildenbrand
  2025-09-04 15:33       ` Lorenzo Stoakes
  0 siblings, 2 replies; 34+ messages in thread
From: Liam R. Howlett @ 2025-09-03 20:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 15:14]:
> On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
> > The ceiling and tree search limit need to be different arguments for the
> > future change in the failed fork attempt.
> >
> > No functional changes intended.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> 
> (Obv. in addition to comment about broken VMA tests :P)
> 
> I guess intent is that if we discover any page tables beyond tree_max then
> we ought to just wipe them all out so, in effect, we don't consider
> mappings at or past tree_max to be valid?

Actually... there are some archs that map outside the vma and they are
valid.. I think mips? and I think lower, but yeah.. it's needed.  This
is why prev->vm_end and next->vm_start are used as page table limits,
afaik.  This is a serious annoyance because it frequently adds walks
that are infrequently necessary to the vma tree.

> 
> I feel like we need a comment to this effect as this is confusing as it is.
> 
> Could we add a kerneldoc comment for free_pgtables() spelling this out?

I'll add a note here, but confusion will probably increase.  I'll add a
note about the tree max as well.

> 
> > ---
> >  mm/internal.h | 4 +++-
> >  mm/memory.c   | 7 ++++---
> >  mm/mmap.c     | 2 +-
> >  mm/vma.c      | 3 ++-
> >  4 files changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 45b725c3dc030..f9a278ac76d83 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -444,7 +444,9 @@ void folio_activate(struct folio *folio);
> >
> >  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		   struct vm_area_struct *start_vma, unsigned long floor,
> > -		   unsigned long ceiling, bool mm_wr_locked);
> > +		   unsigned long ceiling, unsigned long tree_max,
> > +		   bool mm_wr_locked);
> > +
> >  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
> >
> >  struct zap_details;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0ba4f6b718471..3346514562bba 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -371,7 +371,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> >
> >  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		   struct vm_area_struct *vma, unsigned long floor,
> > -		   unsigned long ceiling, bool mm_wr_locked)
> > +		   unsigned long ceiling, unsigned long tree_max,
> > +		   bool mm_wr_locked)
> >  {
> >  	struct unlink_vma_file_batch vb;
> >
> > @@ -385,7 +386,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
> >  		 * be 0.  This will underflow and is okay.
> >  		 */
> > -		next = mas_find(mas, ceiling - 1);
> > +		next = mas_find(mas, tree_max - 1);
> 
> Do we need to put some sort of sanity checks in to make sure tree_max <= ceiling
> (though this 0 case is a pain... so I guess tree_max - 1 <= ceiling - 1?)

Sure!

> 
> >  		if (unlikely(xa_is_zero(next)))
> >  			next = NULL;
> >
> > @@ -405,7 +406,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		 */
> >  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
> >  			vma = next;
> > -			next = mas_find(mas, ceiling - 1);
> > +			next = mas_find(mas, tree_max - 1);
> >  			if (unlikely(xa_is_zero(next)))
> >  				next = NULL;
> >  			if (mm_wr_locked)
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 0995a48b46d59..eba2bc81bc749 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1311,7 +1311,7 @@ void exit_mmap(struct mm_struct *mm)
> >  	mt_clear_in_rcu(&mm->mm_mt);
> >  	vma_iter_set(&vmi, vma->vm_end);
> >  	free_pgtables(&tlb, &vmi.mas, vma, FIRST_USER_ADDRESS,
> > -		      USER_PGTABLES_CEILING, true);
> > +		      USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
> >  	tlb_finish_mmu(&tlb);
> >
> >  	/*
> > diff --git a/mm/vma.c b/mm/vma.c
> > index fd270345c25d3..aa75ca8618609 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -486,6 +486,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> >  		   /* mm_wr_locked = */ true);
> >  	mas_set(mas, vma->vm_end);
> >  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > +		      next ? next->vm_start : USER_PGTABLES_CEILING,
> >  		      next ? next->vm_start : USER_PGTABLES_CEILING,
> >  		      /* mm_wr_locked = */ true);
> >  	tlb_finish_mmu(&tlb);
> > @@ -1232,7 +1233,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
> >  	mas_set(mas_detach, 1);
> >  	/* start and end may be different if there is no prev or next vma. */
> >  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
> > -		      vms->unmap_end, mm_wr_locked);
> > +		      vms->unmap_end, vms->unmap_end, mm_wr_locked);
> >  	tlb_finish_mmu(&tlb);
> >  	vms->clear_ptes = false;
> >  }
> > --
> > 2.47.2
> >


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-08-19 20:33   ` Lorenzo Stoakes
@ 2025-09-04  0:13     ` Liam R. Howlett
  2025-09-04 15:40       ` Lorenzo Stoakes
  0 siblings, 1 reply; 34+ messages in thread
From: Liam R. Howlett @ 2025-09-04  0:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 16:34]:
> On Fri, Aug 15, 2025 at 03:10:31PM -0400, Liam R. Howlett wrote:
> > When the dup_mmap() fails during the vma duplication or setup, don't
> > write the XA_ZERO entry in the vma tree.  Instead, destroy the tree and
> > free the new resources, leaving an empty vma tree.
> 
> Yesss like it!
> 
> >
> > Using XA_ZERO introduced races where the vma could be found between
> > dup_mmap() dropping all locks and exit_mmap() taking the locks.  The
> > race can occur because the mm can be reached through the other trees
> > via successfully copied vmas and other methods such as the swapoff code.
> 
> Yeah god.
> 
> >
> > XA_ZERO was marking the location to stop vma removal and pagetable
> > freeing.  The newly created arguments to the unmap_vmas() and
> > free_pgtables() serve this function.
> 
> Nice.
> 
> >
> > Replacing the XA_ZERO entry use with the new argument list also means
> > the checks for xa_is_zero() are no longer necessary so these are also
> > removed.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> 
> Ah finally the 'action' patch :)
> 
> Obviously my review is on the basis that you fixup the rebase etc.
> 
> This broadly looks right but some various nits etc. below and will have
> another look through on new revision - this whole area is pretty crazy!
> 
> > ---
> >  mm/memory.c |  6 +-----
> >  mm/mmap.c   | 39 ++++++++++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3346514562bba..8cd58f171bae4 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		 * be 0.  This will underflow and is okay.
> >  		 */
> >  		next = mas_find(mas, tree_max - 1);
> > -		if (unlikely(xa_is_zero(next)))
> > -			next = NULL;
> >
> >  		/*
> >  		 * Hide vma from rmap and truncate_pagecache before freeing
> > @@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> >  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
> >  			vma = next;
> >  			next = mas_find(mas, tree_max - 1);
> > -			if (unlikely(xa_is_zero(next)))
> > -				next = NULL;
> >  			if (mm_wr_locked)
> >  				vma_start_write(vma);
> >  			unlink_anon_vmas(vma);
> > @@ -1978,7 +1974,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  				 mm_wr_locked);
> >  		hugetlb_zap_end(vma, &details);
> >  		vma = mas_find(mas, tree_end - 1);
> > -	} while (vma && likely(!xa_is_zero(vma)));
> > +	} while (vma);
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index eba2bc81bc749..5acc0b5f8c14a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
> >  	arch_exit_mmap(mm);
> >
> >  	vma = vma_next(&vmi);
> > -	if (!vma || unlikely(xa_is_zero(vma))) {
> > +	if (!vma) {
> >  		/* Can happen if dup_mmap() received an OOM */
> >  		mmap_read_unlock(mm);
> >  		mmap_write_lock(mm);
> > @@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  		ksm_fork(mm, oldmm);
> >  		khugepaged_fork(mm, oldmm);
> >  	} else {
> > +		unsigned long max;
> >
> >  		/*
> > -		 * The entire maple tree has already been duplicated. If the
> > -		 * mmap duplication fails, mark the failure point with
> > -		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> > -		 * stop releasing VMAs that have not been duplicated after this
> > -		 * point.
> > +		 * The entire maple tree has already been duplicated, but
> > +		 * replacing the vmas failed at mpnt (which could be NULL if
> > +		 * all were allocated but the last vma was not fully set up. Use
> 
> Missing ')'.

Thanks.

> 
> > +		 * the start address of the failure point to clean up the half
> > +		 * initialized tree.
> 
> NIT: Is 'half' correct? Maybe 'partially'?

Thanks.

> 
> >  		 */
> > -		if (mpnt) {
> > -			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > -			mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > -			/* Avoid OOM iterating a broken tree */
> > -			set_bit(MMF_OOM_SKIP, &mm->flags);
> > +		if (!mm->map_count) {
> > +			/* zero vmas were written to the new tree. */
> > +			max = 0;
> 
> OK I guess this then will result in the intentional underflow thing, maybe
> worth mentioning?

No, nothing will happen as the if (max) will evaluate as false.

> 
> > +		} else if (mpnt) {
> > +			/* mid-tree failure */
> 
> Partial?

Right, thanks.

> 
> > +			max = mpnt->vm_start;
> > +		} else {
> > +			/* All vmas were written to the new tree */
> > +			max = ULONG_MAX;
> >  		}
> > +
> > +		/* Hide mm from oom killer because the memory is being freed */
> > +		set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> Obv. update to mm_flags_set(MMF_OOM_SKIP, mm);

Right, happened on rebase.

> 
> > +		if (max) {
> > +			vma_iter_set(&vmi, 0);
> > +			tmp = vma_next(&vmi);
> > +			flush_cache_mm(mm);
> > +			unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);
> 
> (An aside for the unmap_region() impl, maybe let's name the pg_max as
> tree_max to make it consistent across both?)
> 
> Hm we could still use the mmap struct here if we put in vma.h. Just have to
> set vmi, obv prev, next NULL.
> 
> So:
> 
> 	struct mmap_state map {
> 		.vmi = &vmi,
> 	}
> 
> 	unmap_region(&map, tmp, 0, max, max);
> 
> Possibly overkill and hopefully stack ok but makes other callers less
> horrid.
> 
> Maybe also good to add a comment spelling out what each of these params do.

So I've tried to do this cleanly but it hasn't worked out.  I think I'll
add another struct to clean this up in another patch on this series.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-09-03 20:19     ` Liam R. Howlett
@ 2025-09-04 10:20       ` David Hildenbrand
  2025-09-04 15:36         ` Lorenzo Stoakes
  2025-09-04 15:33       ` Lorenzo Stoakes
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-04 10:20 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox

On 03.09.25 22:19, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 15:14]:
>> On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
>>> The ceiling and tree search limit need to be different arguments for the
>>> future change in the failed fork attempt.
>>>
>>> No functional changes intended.
>>>
>>> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>>
>> (Obv. in addition to comment about broken VMA tests :P)
>>
>> I guess intent is that if we discover any page tables beyond tree_max then
>> we ought to just wipe them all out so, in effect, we don't consider
>> mappings at or past tree_max to be valid?
> 
> Actually... there are some archs that map outside the vma and they are
> valid.. I think mips? and I think lower, but yeah.. it's needed.  This
> is why prev->vm_end and next->vm_start are used as page table limits,
> afaik.  This is a serious annoyance because it frequently adds walks
> that are infrequently necessary to the vma tree.

Hm, does that still exist?

I recall something odd ... was it that gate area thingy (in_gate_area) 
we also have to handle in GUP code? The is x86/arm though, not mips.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap()
  2025-09-03 19:56     ` Liam R. Howlett
@ 2025-09-04 15:21       ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-09-04 15:21 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox

On Wed, Sep 03, 2025 at 03:56:03PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 14:38]:
> > On Fri, Aug 15, 2025 at 03:10:27PM -0400, Liam R. Howlett wrote:
> > > Create the new function tear_down_vmas() to remove a range of vmas.
> > > exit_mmap() will be removing all the vmas.
> > >
> > > This is necessary for future patches.
> > >
> > > No functional changes intended.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > This function is pure and complete insanity, but this change looks
> > good. Couple nits below.
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >  mm/mmap.c | 37 ++++++++++++++++++++++++-------------
> > >  1 file changed, 24 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index c4c315b480af7..0995a48b46d59 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1250,6 +1250,29 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags)
> > >  }
> > >  EXPORT_SYMBOL(vm_brk_flags);
> > >
> > > +static inline
> > > +unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
> > > +		struct vm_area_struct *vma, unsigned long max)
> > > +{
> > > +	unsigned long nr_accounted = 0;
> > > +	int count = 0;
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +	vma_iter_set(vmi, vma->vm_end);
> > > +	do {
> > > +		if (vma->vm_flags & VM_ACCOUNT)
> > > +			nr_accounted += vma_pages(vma);
> > > +		vma_mark_detached(vma);
> > > +		remove_vma(vma);
> > > +		count++;
> > > +		cond_resched();
> > > +		vma = vma_next(vmi);
> > > +	} while (vma && vma->vm_end <= max);
> > > +
> > > +	BUG_ON(count != mm->map_count);
> >
> > Can we make this a WARN_ON() or WARN_ON_ONCE() while we're at it?
>
> Sure!

Thanks :)

>
> >
> > > +	return nr_accounted;
> > > +}
> > > +
> > >  /* Release all mmaps. */
> > >  void exit_mmap(struct mm_struct *mm)
> > >  {
> > > @@ -1257,7 +1280,6 @@ void exit_mmap(struct mm_struct *mm)
> > >  	struct vm_area_struct *vma;
> > >  	unsigned long nr_accounted = 0;
> >
> > No need to initialise this to 0 any more.
>
> There is a goto label below that skips calling the tear down, so this is
> still needed.

Ah yeah, sorry missed the goto destroy there. No worries then :)

>
> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas
  2025-09-03 19:57     ` Liam R. Howlett
@ 2025-09-04 15:23       ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-09-04 15:23 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox

On Wed, Sep 03, 2025 at 03:57:37PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 14:48]:
> > On Fri, Aug 15, 2025 at 03:10:28PM -0400, Liam R. Howlett wrote:
> > > Add a limit to the vma search instead of using the start and end of the
> > > one passed in.
> > >
> > > No functional changes intended.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > ---
> > >  mm/vma.c | 6 ++++--
> > >  mm/vma.h | 1 +
> > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 3b12c7579831b..fd270345c25d3 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -474,6 +474,7 @@ void remove_vma(struct vm_area_struct *vma)
> > >   * Called with the mm semaphore held.
> > >   */
> > >  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > > +		unsigned long vma_min, unsigned long vma_max,
> > >  		struct vm_area_struct *prev, struct vm_area_struct *next)
> >
> > Luckily we will be getting rid of this as soon as I complete the mmap_prepare +
> > friends series (_actively_ working on this right now).
> >
> > Anyway, this is getting quickly horrid, since this is only in vma.c, we could
> > make this static and pass through map to reduce some of the horror?
> >
> > We actually can hand it vmi (and thus &vmi->mas via map, so we could implement
> > it as:
> >
> > void unmap_region(struct mmap_state *map, struct vm_area_struct *vma,
> > 		  unsigned long min, unsigned long max);
> >
>
> Ah... I could do this on top of the last patch.. that might be more
> clear?

Ack, yeah I think so! Thanks! :)

>
> > >  {
> > >  	struct mm_struct *mm = vma->vm_mm;
> > > @@ -481,7 +482,7 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > >
> > >  	tlb_gather_mmu(&tlb, mm);
> > >  	update_hiwater_rss(mm);
> > > -	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
> > > +	unmap_vmas(&tlb, mas, vma, vma_min, vma_max, vma_max,
> > >  		   /* mm_wr_locked = */ true);
> > >  	mas_set(mas, vma->vm_end);
> > >  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > > @@ -2417,7 +2418,8 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> > >
> > >  		vma_iter_set(vmi, vma->vm_end);
> > >  		/* Undo any partial mapping done by a device driver. */
> > > -		unmap_region(&vmi->mas, vma, map->prev, map->next);
> > > +		unmap_region(&vmi->mas, vma, vma->vm_start, vma->vm_end,
> > > +			     map->prev, map->next);
> > >
> > >  		return error;
> > >  	}
> > > diff --git a/mm/vma.h b/mm/vma.h
> > > index b123a9cdedb0d..336dae295853e 100644
> > > --- a/mm/vma.h
> > > +++ b/mm/vma.h
> > > @@ -281,6 +281,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> > >  void remove_vma(struct vm_area_struct *vma);
> > >
> > >  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > > +		unsigned long min, unsigned long max,
> > >  		struct vm_area_struct *prev, struct vm_area_struct *next);
> > >
> > >  /* We are about to modify the VMA's flags. */
> > > --
> > > 2.47.2
> > >


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-09-03 20:19     ` Liam R. Howlett
  2025-09-04 10:20       ` David Hildenbrand
@ 2025-09-04 15:33       ` Lorenzo Stoakes
  1 sibling, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-09-04 15:33 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox

On Wed, Sep 03, 2025 at 04:19:04PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 15:14]:
> > On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
> > > The ceiling and tree search limit need to be different arguments for the
> > > future change in the failed fork attempt.
> > >
> > > No functional changes intended.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > (Obv. in addition to comment about broken VMA tests :P)
> >
> > I guess intent is that if we discover any page tables beyond tree_max then
> > we ought to just wipe them all out so, in effect, we don't consider
> > mappings at or past tree_max to be valid?
>
> Actually... there are some archs that map outside the vma and they are
> valid.. I think mips? and I think lower, but yeah.. it's needed.  This
> is why prev->vm_end and next->vm_start are used as page table limits,
> afaik.  This is a serious annoyance because it frequently adds walks
> that are infrequently necessary to the vma tree.

ugh god. I was vaguely aware of this but that's gross. Very gross.

I need to document all the VMA weirdnesses smoewhere.

What do they do this for? Guard pages or something?

>
> >
> > I feel like we need a comment to this effect as this is confusing as it is.
> >
> > Could we add a kerneldoc comment for free_pgtables() spelling this out?
>
> I'll add a note here, but confusion will probably increase.  I'll add a
> note about the tree max as well.

Thanks!

>
> >
> > > ---
> > >  mm/internal.h | 4 +++-
> > >  mm/memory.c   | 7 ++++---
> > >  mm/mmap.c     | 2 +-
> > >  mm/vma.c      | 3 ++-
> > >  4 files changed, 10 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 45b725c3dc030..f9a278ac76d83 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -444,7 +444,9 @@ void folio_activate(struct folio *folio);
> > >
> > >  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		   struct vm_area_struct *start_vma, unsigned long floor,
> > > -		   unsigned long ceiling, bool mm_wr_locked);
> > > +		   unsigned long ceiling, unsigned long tree_max,
> > > +		   bool mm_wr_locked);
> > > +
> > >  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
> > >
> > >  struct zap_details;
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 0ba4f6b718471..3346514562bba 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -371,7 +371,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> > >
> > >  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		   struct vm_area_struct *vma, unsigned long floor,
> > > -		   unsigned long ceiling, bool mm_wr_locked)
> > > +		   unsigned long ceiling, unsigned long tree_max,
> > > +		   bool mm_wr_locked)
> > >  {
> > >  	struct unlink_vma_file_batch vb;
> > >
> > > @@ -385,7 +386,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
> > >  		 * be 0.  This will underflow and is okay.
> > >  		 */
> > > -		next = mas_find(mas, ceiling - 1);
> > > +		next = mas_find(mas, tree_max - 1);
> >
> > Do we need to put some sort of sanity checks in to make sure tree_max <= ceiling
> > (though this 0 case is a pain... so I guess tree_max - 1 <= ceiling - 1?)
>
> Sure!

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables()
  2025-09-04 10:20       ` David Hildenbrand
@ 2025-09-04 15:36         ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-09-04 15:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam R. Howlett, maple-tree, linux-mm, linux-kernel,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Andrew Morton, Jann Horn, Pedro Falcato, Charan Teja Kalla,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, Matthew Wilcox

On Thu, Sep 04, 2025 at 12:20:55PM +0200, David Hildenbrand wrote:
> On 03.09.25 22:19, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 15:14]:
> > > On Fri, Aug 15, 2025 at 03:10:29PM -0400, Liam R. Howlett wrote:
> > > > The ceiling and tree search limit need to be different arguments for the
> > > > future change in the failed fork attempt.
> > > >
> > > > No functional changes intended.
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >
> > > (Obv. in addition to comment about broken VMA tests :P)
> > >
> > > I guess intent is that if we discover any page tables beyond tree_max then
> > > we ought to just wipe them all out so, in effect, we don't consider
> > > mappings at or past tree_max to be valid?
> >
> > Actually... there are some archs that map outside the vma and they are
> > valid.. I think mips? and I think lower, but yeah.. it's needed.  This
> > is why prev->vm_end and next->vm_start are used as page table limits,
> > afaik.  This is a serious annoyance because it frequently adds walks
> > that are infrequently necessary to the vma tree.
>
> Hm, does that still exist?
>
> I recall something odd ... was it that gate area thingy (in_gate_area) we
> also have to handle in GUP code? The is x86/arm though, not mips.

Isn't gate area the VSYSCALL area so that's basically a kernel mapping address
that we allow userland to access?

>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
  2025-09-04  0:13     ` Liam R. Howlett
@ 2025-09-04 15:40       ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-09-04 15:40 UTC (permalink / raw)
  To: Liam R. Howlett, David Hildenbrand, maple-tree, linux-mm,
	linux-kernel, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Andrew Morton, Jann Horn, Pedro Falcato,
	Charan Teja Kalla, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, Matthew Wilcox

On Wed, Sep 03, 2025 at 08:13:45PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250819 16:34]:
> > On Fri, Aug 15, 2025 at 03:10:31PM -0400, Liam R. Howlett wrote:
> > > When the dup_mmap() fails during the vma duplication or setup, don't
> > > write the XA_ZERO entry in the vma tree.  Instead, destroy the tree and
> > > free the new resources, leaving an empty vma tree.
> >
> > Yesss like it!
> >
> > >
> > > Using XA_ZERO introduced races where the vma could be found between
> > > dup_mmap() dropping all locks and exit_mmap() taking the locks.  The
> > > race can occur because the mm can be reached through the other trees
> > > via successfully copied vmas and other methods such as the swapoff code.
> >
> > Yeah god.
> >
> > >
> > > XA_ZERO was marking the location to stop vma removal and pagetable
> > > freeing.  The newly created arguments to the unmap_vmas() and
> > > free_pgtables() serve this function.
> >
> > Nice.
> >
> > >
> > > Replacing the XA_ZERO entry use with the new argument list also means
> > > the checks for xa_is_zero() are no longer necessary so these are also
> > > removed.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > Ah finally the 'action' patch :)
> >
> > Obviously my review is on the basis that you fixup the rebase etc.
> >
> > This broadly looks right but some various nits etc. below and will have
> > another look through on new revision - this whole area is pretty crazy!
> >
> > > ---
> > >  mm/memory.c |  6 +-----
> > >  mm/mmap.c   | 39 ++++++++++++++++++++++++++++-----------
> > >  2 files changed, 29 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3346514562bba..8cd58f171bae4 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		 * be 0.  This will underflow and is okay.
> > >  		 */
> > >  		next = mas_find(mas, tree_max - 1);
> > > -		if (unlikely(xa_is_zero(next)))
> > > -			next = NULL;
> > >
> > >  		/*
> > >  		 * Hide vma from rmap and truncate_pagecache before freeing
> > > @@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
> > >  			vma = next;
> > >  			next = mas_find(mas, tree_max - 1);
> > > -			if (unlikely(xa_is_zero(next)))
> > > -				next = NULL;
> > >  			if (mm_wr_locked)
> > >  				vma_start_write(vma);
> > >  			unlink_anon_vmas(vma);
> > > @@ -1978,7 +1974,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  				 mm_wr_locked);
> > >  		hugetlb_zap_end(vma, &details);
> > >  		vma = mas_find(mas, tree_end - 1);
> > > -	} while (vma && likely(!xa_is_zero(vma)));
> > > +	} while (vma);
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index eba2bc81bc749..5acc0b5f8c14a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
> > >  	arch_exit_mmap(mm);
> > >
> > >  	vma = vma_next(&vmi);
> > > -	if (!vma || unlikely(xa_is_zero(vma))) {
> > > +	if (!vma) {
> > >  		/* Can happen if dup_mmap() received an OOM */
> > >  		mmap_read_unlock(mm);
> > >  		mmap_write_lock(mm);
> > > @@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > >  		ksm_fork(mm, oldmm);
> > >  		khugepaged_fork(mm, oldmm);
> > >  	} else {
> > > +		unsigned long max;
> > >
> > >  		/*
> > > -		 * The entire maple tree has already been duplicated. If the
> > > -		 * mmap duplication fails, mark the failure point with
> > > -		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> > > -		 * stop releasing VMAs that have not been duplicated after this
> > > -		 * point.
> > > +		 * The entire maple tree has already been duplicated, but
> > > +		 * replacing the vmas failed at mpnt (which could be NULL if
> > > +		 * all were allocated but the last vma was not fully set up. Use
> >
> > Missing ')'.
>
> Thanks.
>
> >
> > > +		 * the start address of the failure point to clean up the half
> > > +		 * initialized tree.
> >
> > NIT: Is 'half' correct? Maybe 'partially'?
>
> Thanks.
>
> >
> > >  		 */
> > > -		if (mpnt) {
> > > -			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > > -			mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > -			/* Avoid OOM iterating a broken tree */
> > > -			set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +		if (!mm->map_count) {
> > > +			/* zero vmas were written to the new tree. */
> > > +			max = 0;
> >
> > OK I guess this then will result in the intentional underflow thing, maybe
> > worth mentioning?
>
> No, nothing will happen as the if (max) will evaluate as false.

Ahhh ok. This makes more sense :)

>
> >
> > > +		} else if (mpnt) {
> > > +			/* mid-tree failure */
> >
> > Partial?
>
> Right, thanks.
>
> >
> > > +			max = mpnt->vm_start;
> > > +		} else {
> > > +			/* All vmas were written to the new tree */
> > > +			max = ULONG_MAX;
> > >  		}
> > > +
> > > +		/* Hide mm from oom killer because the memory is being freed */
> > > +		set_bit(MMF_OOM_SKIP, &mm->flags);
> >
> > Obv. update to mm_flags_set(MMF_OOM_SKIP, mm);
>
> Right, happened on rebase.
>
> >
> > > +		if (max) {
> > > +			vma_iter_set(&vmi, 0);
> > > +			tmp = vma_next(&vmi);
> > > +			flush_cache_mm(mm);
> > > +			unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);
> >
> > (An aside for the unmap_region() impl, maybe let's name the pg_max as
> > tree_max to make it consistent across both?)
> >
> > Hm we could still use the mmap struct here if we put in vma.h. Just have to
> > set vmi, obv prev, next NULL.
> >
> > So:
> >
> > 	struct mmap_state map {
> > 		.vmi = &vmi,
> > 	}
> >
> > 	unmap_region(&map, tmp, 0, max, max);
> >
> > Possibly overkill and hopefully stack ok but makes other callers less
> > horrid.
> >
> > Maybe also good to add a comment spelling out what each of these params do.
>
> So I've tried to do this cleanly but it hasn't worked out.  I think I'll
> add another struct to clean this up in another patch on this series.

Moar helper structs! Yes yes yes! ;)

>
> Thanks,
> Liam

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-09-04 15:41 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
2025-08-19 18:27   ` Lorenzo Stoakes
2025-08-21 21:12   ` Chris Li
2025-08-15 19:10 ` [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap() Liam R. Howlett
2025-08-19 18:38   ` Lorenzo Stoakes
2025-09-03 19:56     ` Liam R. Howlett
2025-09-04 15:21       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas Liam R. Howlett
2025-08-19 18:48   ` Lorenzo Stoakes
2025-09-03 19:57     ` Liam R. Howlett
2025-09-04 15:23       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
2025-08-18 15:36   ` Lorenzo Stoakes
2025-08-18 15:54     ` Liam R. Howlett
2025-08-19 19:14   ` Lorenzo Stoakes
2025-09-03 20:19     ` Liam R. Howlett
2025-09-04 10:20       ` David Hildenbrand
2025-09-04 15:36         ` Lorenzo Stoakes
2025-09-04 15:33       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region() Liam R. Howlett
2025-08-19 19:27   ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
2025-08-18 15:12   ` Lorenzo Stoakes
2025-08-18 15:29     ` Lorenzo Stoakes
2025-08-19 20:33   ` Lorenzo Stoakes
2025-09-04  0:13     ` Liam R. Howlett
2025-09-04 15:40       ` Lorenzo Stoakes
2025-08-15 19:49 ` [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Jann Horn
2025-08-18 15:48   ` Liam R. Howlett
2025-08-18  9:44 ` David Hildenbrand
2025-08-18 14:26   ` Charan Teja Kalla
2025-08-18 14:54     ` Liam R. Howlett
2025-08-18 15:47   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).