[PATCH v8 00/21] Avoid MAP_FIXED gap exposure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/21] Avoid MAP_FIXED gap exposure
@ 2024-08-30  4:00 Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 01/21] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
                   ` (21 more replies)
  0 siblings, 22 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention.  Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode.  The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.

This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts.  The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree.  The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).

Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed.  vma_start_write() are called on both
parts of the split vma, so this state is detectable.

If existing vmas have a vm_ops->close(), then they will be called prior
to mapping the new vmas (and ptes are cleared out).  Without calling
->close(), hugetlbfs tests fail (hugemmap06 specifically) due to
resources still being marked as 'busy'.  Unfortunately, calling the
corresponding ->open() may not restore the state of the vmas, so it is
safer to keep the existing failure scenario where a gap is inserted and
never replaced.  The failure scenario is in its own patch (0015) for
traceability.

RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20240625191145.3382793-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20240704182718.2653918-1-Liam.Howlett@oracle.com/
v4: https://lore.kernel.org/linux-mm/20240710192250.4114783-1-Liam.Howlett@oracle.com/
v5: https://lore.kernel.org/linux-mm/20240717200709.1552558-1-Liam.Howlett@oracle.com/
v6: https://lore.kernel.org/all/20240820235730.2852400-1-Liam.Howlett@oracle.com/
v7: https://lore.kernel.org/all/20240822192543.3359552-1-Liam.Howlett@oracle.com/

Changes since v7:

This is all the patches I've sent for v7 fixups plus the return code for
mseal().  The incorrect return code was introduced in an earlier patch
and then modified (still incorrectly) later, so this version will
hopefully bisect cleanly.

- Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
- Passed through error returned from vms_gather_munmap_vmas() in
  mmap_region() - Thanks Jeff
- Added review tag on last patch - Thanks Lorenzo
- Added #ifdef CONFIG_MMU to vma.h where necessary - Thanks Lorenzo,
  Bert, and Geert
- Fix null pointer dereference in vms_abort_munmap_vmas() - Thanks Dan

Liam R. Howlett (21):
  mm/vma: Correctly position vma_iterator in __split_vma()
  mm/vma: Introduce abort_munmap_vmas()
  mm/vma: Introduce vmi_complete_munmap_vmas()
  mm/vma: Extract the gathering of vmas from do_vmi_align_munmap()
  mm/vma: Introduce vma_munmap_struct for use in munmap operations
  mm/vma: Change munmap to use vma_munmap_struct() for accounting and
    surrounding vmas
  mm/vma: Extract validate_mm() from vma_complete()
  mm/vma: Inline munmap operation in mmap_region()
  mm/vma: Expand mmap_region() munmap call
  mm/vma: Support vma == NULL in init_vma_munmap()
  mm/mmap: Reposition vma iterator in mmap_region()
  mm/vma: Track start and end for munmap in vma_munmap_struct
  mm: Clean up unmap_region() argument list
  mm/mmap: Avoid zeroing vma tree in mmap_region()
  mm: Change failure of MAP_FIXED to restoring the gap on failure
  mm/mmap: Use PHYS_PFN in mmap_region()
  mm/mmap: Use vms accounted pages in mmap_region()
  ipc/shm, mm: Drop do_vma_munmap()
  mm: Move may_expand_vm() check in mmap_region()
  mm/vma: Drop incorrect comment from vms_gather_munmap_vmas()
  mm/vma.h: Optimise vma_munmap_struct

 include/linux/mm.h |   6 +-
 ipc/shm.c          |   8 +-
 mm/mmap.c          | 140 +++++++++---------
 mm/vma.c           | 362 +++++++++++++++++++++++++++------------------
 mm/vma.h           | 168 ++++++++++++++++++---
 5 files changed, 434 insertions(+), 250 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v8 01/21] mm/vma: Correctly position vma_iterator in __split_vma()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 02/21] mm/vma: Introduce abort_munmap_vmas() Liam R. Howlett
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The vma iterator may be left pointing to the newly created vma.  This
happens when inserting the new vma at the end of the old vma
(!new_below).

The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.

This has limited impact in the current code, but is required for future
changes.

Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/vma.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vma.c b/mm/vma.c
index 5850f7c0949b..066de79b7b73 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -177,7 +177,7 @@ void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 /*
  * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
  * has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original VMA.
  */
 static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		       unsigned long addr, int new_below)
@@ -246,6 +246,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	/* Success. */
 	if (new_below)
 		vma_next(vmi);
+	else
+		vma_prev(vmi);
+
 	return 0;
 
 out_free_mpol:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 02/21] mm/vma: Introduce abort_munmap_vmas()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 01/21] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 03/21] mm/vma: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Extract clean up of failed munmap() operations from
do_vmi_align_munmap().  This simplifies later patches in the series.

It is worth noting that the mas_for_each() loop now has a different
upper limit.  This should not change the number of vmas visited for
reattaching to the main vma tree (mm_mt), as all vmas are reattached in
both scenarios.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/vma.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 066de79b7b73..58ecd447670d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -668,6 +668,22 @@ void vma_complete(struct vma_prepare *vp,
 	validate_mm(mm);
 }
 
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -834,11 +850,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
-	mas_set(&mas_detach, 0);
-	mas_for_each(&mas_detach, next, end)
-		vma_mark_detached(next, false);
-
-	__mt_destroy(&mt_detach);
+	abort_munmap_vmas(&mas_detach);
 start_split_failed:
 map_count_exceeded:
 	validate_mm(mm);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 03/21] mm/vma: Introduce vmi_complete_munmap_vmas()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 01/21] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 02/21] mm/vma: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 04/21] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation.  Extracting this makes
the later patch in the series easier to understand.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/vma.c | 80 ++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 25 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 58ecd447670d..3a2098464b8f 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -684,6 +684,58 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @vma: The first vma to be munmapped
+ * @mm: The mm struct
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct mm_struct *mm, unsigned long start, unsigned long end,
+		bool unlock, struct ma_state *mas_detach,
+		unsigned long locked_vm)
+{
+	struct vm_area_struct *prev, *next;
+	int count;
+
+	count = mas_detach->index + 1;
+	mm->map_count -= count;
+	mm->locked_vm -= locked_vm;
+	if (unlock)
+		mmap_write_downgrade(mm);
+
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+		     !unlock);
+	/* Statistics and freeing VMAs */
+	mas_set(mas_detach, 0);
+	remove_mt(mm, mas_detach);
+	validate_mm(mm);
+	if (unlock)
+		mmap_read_unlock(mm);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -703,7 +755,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool unlock)
 {
-	struct vm_area_struct *prev, *next = NULL;
+	struct vm_area_struct *next = NULL;
 	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
@@ -818,31 +870,9 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	mm->locked_vm -= locked_vm;
-	mm->map_count -= count;
-	if (unlock)
-		mmap_write_downgrade(mm);
-
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
-	if (next)
-		vma_iter_prev_range(vmi);
-
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(&mas_detach, 1);
-	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
-	/* Statistics and freeing VMAs */
-	mas_set(&mas_detach, 0);
-	remove_mt(mm, &mas_detach);
-	validate_mm(mm);
-	if (unlock)
-		mmap_read_unlock(mm);
+	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+				 locked_vm);
 
-	__mt_destroy(&mt_detach);
 	return 0;
 
 modify_vma_failed:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 04/21] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (2 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 03/21] mm/vma: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 05/21] mm/vma: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later.  Part of the gathering is the
splitting of vmas that span the boundary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 95 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 62 insertions(+), 33 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 3a2098464b8f..f691c1db5b12 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -737,32 +737,30 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date.  Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
  * @vmi: The vma iterator
  * @vma: The starting vm_area_struct
  * @mm: The mm_struct
  * @start: The aligned start address to munmap.
  * @end: The aligned end address to munmap.
  * @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
+ * @locked_vm: a pointer to store the VM_LOCKED pages count.
  *
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise
  */
-int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+static int
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf, bool unlock)
+		    unsigned long end, struct list_head *uf,
+		    struct ma_state *mas_detach, unsigned long *locked_vm)
 {
 	struct vm_area_struct *next = NULL;
-	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
-	unsigned long locked_vm = 0;
-	MA_STATE(mas_detach, &mt_detach, 0, 0);
-	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-	mt_on_stack(mt_detach);
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -789,8 +787,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			goto start_split_failed;
 		}
 
-		error = __split_vma(vmi, vma, start, 1);
-		if (error)
+		if (__split_vma(vmi, vma, start, 1))
 			goto start_split_failed;
 	}
 
@@ -807,20 +804,18 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 		/* Does it split the end? */
 		if (next->vm_end > end) {
-			error = __split_vma(vmi, next, end, 0);
-			if (error)
+			if (__split_vma(vmi, next, end, 0))
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(&mas_detach, count);
-		error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
-		if (error)
+		mas_set(mas_detach, count++);
+		if (mas_store_gfp(mas_detach, next, GFP_KERNEL))
 			goto munmap_gather_failed;
+
 		vma_mark_detached(next, true);
 		if (next->vm_flags & VM_LOCKED)
-			locked_vm += vma_pages(next);
+			*locked_vm += vma_pages(next);
 
-		count++;
 		if (unlikely(uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
@@ -831,9 +826,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			 * split, despite we could. This is unlikely enough
 			 * failure that it's not worth optimizing it for.
 			 */
-			error = userfaultfd_unmap_prep(next, start, end, uf);
-
-			if (error)
+			if (userfaultfd_unmap_prep(next, start, end, uf))
 				goto userfaultfd_error;
 		}
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
@@ -845,7 +838,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
 	{
-		MA_STATE(test, &mt_detach, 0, 0);
+		MA_STATE(test, mas_detach->tree, 0, 0);
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
@@ -865,6 +858,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	while (vma_iter_addr(vmi) > start)
 		vma_iter_prev_range(vmi);
 
+	return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+modify_vma_failed:
+	abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+	return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct mm_struct *mm, unsigned long start, unsigned long end,
+		struct list_head *uf, bool unlock)
+{
+	struct maple_tree mt_detach;
+	MA_STATE(mas_detach, &mt_detach, 0, 0);
+	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+	mt_on_stack(mt_detach);
+	int error;
+	unsigned long locked_vm = 0;
+
+	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+				       &mas_detach, &locked_vm);
+	if (error)
+		goto gather_failed;
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -872,17 +907,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	/* Point of no return */
 	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
 				 locked_vm);
-
 	return 0;
 
-modify_vma_failed:
 clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
 	abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 05/21] mm/vma: Introduce vma_munmap_struct for use in munmap operations
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (3 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 04/21] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 06/21] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.

Update vmi_ function names to vms_ to indicate the first argument
type change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 140 +++++++++++++++++++++++++++++--------------------------
 mm/vma.h |  16 +++++++
 2 files changed, 90 insertions(+), 66 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index f691c1db5b12..f24b52a87458 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -80,6 +80,32 @@ static void init_multi_vma_prep(struct vma_prepare *vp,
 
 }
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	vms->mm = vma->vm_mm;
+	vms->start = start;
+	vms->end = end;
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = 0;
+}
+
 /*
  * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
  * in front of (at a lower virtual address and file offset than) the vma.
@@ -685,81 +711,62 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 }
 
 /*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @vma: The first vma to be munmapped
- * @mm: The mm struct
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
  *
- * This function updates the mm_struct, unmaps the region, frees the resources
+ * This updates the mm_struct, unmaps the region, frees the resources
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		struct mm_struct *mm, unsigned long start, unsigned long end,
-		bool unlock, struct ma_state *mas_detach,
-		unsigned long locked_vm)
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *prev, *next;
-	int count;
+	struct mm_struct *mm;
 
-	count = mas_detach->index + 1;
-	mm->map_count -= count;
-	mm->locked_vm -= locked_vm;
-	if (unlock)
+	mm = vms->mm;
+	mm->map_count -= vms->vma_count;
+	mm->locked_vm -= vms->locked_vm;
+	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
+	prev = vma_iter_prev_range(vms->vmi);
+	next = vma_next(vms->vmi);
 	if (next)
-		vma_iter_prev_range(vmi);
+		vma_iter_prev_range(vms->vmi);
 
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
+	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+		     vms->vma_count, !vms->unlock);
 	/* Statistics and freeing VMAs */
 	mas_set(mas_detach, 0);
 	remove_mt(mm, mas_detach);
 	validate_mm(mm);
-	if (unlock)
+	if (vms->unlock)
 		mmap_read_unlock(mm);
 
 	__mt_destroy(mas_detach->tree);
 }
 
 /*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
  * and marking the vmas as isolated.
  *
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
  * @mas_detach: The maple state tracking the detached tree
- * @locked_vm: a pointer to store the VM_LOCKED pages count.
  *
  * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise
  */
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf,
-		    struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
-	int count = 0;
 	int error = -ENOMEM;
 
 	/*
@@ -771,23 +778,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 */
 
 	/* Does it split the first one? */
-	if (start > vma->vm_start) {
+	if (vms->start > vms->vma->vm_start) {
 
 		/*
 		 * Make sure that map_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (vms->end < vms->vma->vm_end &&
+		    vms->mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
 		/* Don't bother splitting the VMA if we can't unmap it anyway */
-		if (!can_modify_vma(vma)) {
+		if (!can_modify_vma(vms->vma)) {
 			error = -EPERM;
 			goto start_split_failed;
 		}
 
-		if (__split_vma(vmi, vma, start, 1))
+		if (__split_vma(vms->vmi, vms->vma, vms->start, 1))
 			goto start_split_failed;
 	}
 
@@ -795,7 +803,7 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vma;
+	next = vms->vma;
 	do {
 		if (!can_modify_vma(next)) {
 			error = -EPERM;
@@ -803,20 +811,20 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		}
 
 		/* Does it split the end? */
-		if (next->vm_end > end) {
-			if (__split_vma(vmi, next, end, 0))
+		if (next->vm_end > vms->end) {
+			if (__split_vma(vms->vmi, next, vms->end, 0))
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(mas_detach, count++);
+		mas_set(mas_detach, vms->vma_count++);
 		if (mas_store_gfp(mas_detach, next, GFP_KERNEL))
 			goto munmap_gather_failed;
 
 		vma_mark_detached(next, true);
 		if (next->vm_flags & VM_LOCKED)
-			*locked_vm += vma_pages(next);
+			vms->locked_vm += vma_pages(next);
 
-		if (unlikely(uf)) {
+		if (unlikely(vms->uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
 			 * will remain split, but userland will get a
@@ -826,14 +834,15 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			 * split, despite we could. This is unlikely enough
 			 * failure that it's not worth optimizing it for.
 			 */
-			if (userfaultfd_unmap_prep(next, start, end, uf))
+			if (userfaultfd_unmap_prep(next, vms->start, vms->end,
+						   vms->uf))
 				goto userfaultfd_error;
 		}
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
-		BUG_ON(next->vm_start < start);
-		BUG_ON(next->vm_start > end);
+		BUG_ON(next->vm_start < vms->start);
+		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*vmi, next, end);
+	} for_each_vma_range(*(vms->vmi), next, vms->end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -842,21 +851,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
-		vma_iter_set(vmi, start);
+		vma_iter_set(vms->vmi, vms->start);
 		rcu_read_lock();
-		vma_test = mas_find(&test, count - 1);
-		for_each_vma_range(*vmi, vma_mas, end) {
+		vma_test = mas_find(&test, vms->vma_count - 1);
+		for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
 			BUG_ON(vma_mas != vma_test);
 			test_count++;
-			vma_test = mas_next(&test, count - 1);
+			vma_test = mas_next(&test, vms->vma_count - 1);
 		}
 		rcu_read_unlock();
-		BUG_ON(count != test_count);
+		BUG_ON(vms->vma_count != test_count);
 	}
 #endif
 
-	while (vma_iter_addr(vmi) > start)
-		vma_iter_prev_range(vmi);
+	while (vma_iter_addr(vms->vmi) > vms->start)
+		vma_iter_prev_range(vms->vmi);
 
 	return 0;
 
@@ -892,11 +901,11 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	MA_STATE(mas_detach, &mt_detach, 0, 0);
 	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 	mt_on_stack(mt_detach);
+	struct vma_munmap_struct vms;
 	int error;
-	unsigned long locked_vm = 0;
 
-	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
-				       &mas_detach, &locked_vm);
+	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
 
@@ -905,8 +914,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
-				 locked_vm);
+	vms_complete_munmap_vmas(&vms, &mas_detach);
 	return 0;
 
 clear_tree_failed:
diff --git a/mm/vma.h b/mm/vma.h
index da31d0f62157..cb67acf59012 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -26,6 +26,22 @@ struct unlink_vma_file_batch {
 	struct vm_area_struct *vmas[8];
 };
 
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+	struct vma_iterator *vmi;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;     /* The first vma to munmap */
+	struct list_head *uf;           /* Userfaultfd list_head */
+	unsigned long start;            /* Aligned start addr (inclusive) */
+	unsigned long end;              /* Aligned end addr (exclusive) */
+	int vma_count;                  /* Number of vmas that will be removed */
+	unsigned long nr_pages;         /* Number of pages being removed */
+	unsigned long locked_vm;        /* Number of locked pages */
+	bool unlock;                    /* Unlock after the munmap */
+};
+
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
 void validate_mm(struct mm_struct *mm);
 #else
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 06/21] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (4 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 05/21] mm/vma: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 07/21] mm/vma: Extract validate_mm() from vma_complete() Liam R. Howlett
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.

Since remove_mt() is only called in one location and the contents will
be reduced to almost nothing.  The remains of the function can be added
to vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/vma.c | 83 +++++++++++++++++++++++++++++---------------------------
 mm/vma.h |  6 ++++
 2 files changed, 49 insertions(+), 40 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index f24b52a87458..6d042cd46cdb 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -103,7 +103,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 }
 
 /*
@@ -299,30 +300,6 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return __split_vma(vmi, vma, addr, new_below);
 }
 
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
-	unsigned long nr_accounted = 0;
-	struct vm_area_struct *vma;
-
-	/* Update high watermark before we lower total_vm */
-	update_hiwater_vm(mm);
-	mas_for_each(mas, vma, ULONG_MAX) {
-		long nrpages = vma_pages(vma);
-
-		if (vma->vm_flags & VM_ACCOUNT)
-			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, -nrpages);
-		remove_vma(vma, false);
-	}
-	vm_unacct_memory(nr_accounted);
-}
-
 /*
  * init_vma_prep() - Initializer wrapper for vma_prepare struct
  * @vp: The vma_prepare struct
@@ -722,7 +699,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
-	struct vm_area_struct *prev, *next;
+	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
 	mm = vms->mm;
@@ -731,21 +708,31 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vms->vmi);
-	next = vma_next(vms->vmi);
-	if (next)
-		vma_iter_prev_range(vms->vmi);
-
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
-		     vms->vma_count, !vms->unlock);
-	/* Statistics and freeing VMAs */
+	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	/* Update high watermark before we lower total_vm */
+	update_hiwater_vm(mm);
+	/* Stat accounting */
+	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+	/* Paranoid bookkeeping */
+	VM_WARN_ON(vms->exec_vm > mm->exec_vm);
+	VM_WARN_ON(vms->stack_vm > mm->stack_vm);
+	VM_WARN_ON(vms->data_vm > mm->data_vm);
+	mm->exec_vm -= vms->exec_vm;
+	mm->stack_vm -= vms->stack_vm;
+	mm->data_vm -= vms->data_vm;
+
+	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
-	remove_mt(mm, mas_detach);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		remove_vma(vma, false);
+
+	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
 	if (vms->unlock)
 		mmap_read_unlock(mm);
@@ -798,18 +785,19 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (__split_vma(vms->vmi, vms->vma, vms->start, 1))
 			goto start_split_failed;
 	}
+	vms->prev = vma_prev(vms->vmi);
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vms->vma;
-	do {
+	for_each_vma_range(*(vms->vmi), next, vms->end) {
+		long nrpages;
+
 		if (!can_modify_vma(next)) {
 			error = -EPERM;
 			goto modify_vma_failed;
 		}
-
 		/* Does it split the end? */
 		if (next->vm_end > vms->end) {
 			if (__split_vma(vms->vmi, next, vms->end, 0))
@@ -821,8 +809,21 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto munmap_gather_failed;
 
 		vma_mark_detached(next, true);
+		nrpages = vma_pages(next);
+
+		vms->nr_pages += nrpages;
 		if (next->vm_flags & VM_LOCKED)
-			vms->locked_vm += vma_pages(next);
+			vms->locked_vm += nrpages;
+
+		if (next->vm_flags & VM_ACCOUNT)
+			vms->nr_accounted += nrpages;
+
+		if (is_exec_mapping(next->vm_flags))
+			vms->exec_vm += nrpages;
+		else if (is_stack_mapping(next->vm_flags))
+			vms->stack_vm += nrpages;
+		else if (is_data_mapping(next->vm_flags))
+			vms->data_vm += nrpages;
 
 		if (unlikely(vms->uf)) {
 			/*
@@ -842,7 +843,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		BUG_ON(next->vm_start < vms->start);
 		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*(vms->vmi), next, vms->end);
+	}
+
+	vms->next = vma_next(vms->vmi);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
diff --git a/mm/vma.h b/mm/vma.h
index cb67acf59012..cbf55e0e0c4f 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -33,12 +33,18 @@ struct vma_munmap_struct {
 	struct vma_iterator *vmi;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;     /* The first vma to munmap */
+	struct vm_area_struct *prev;    /* vma before the munmap area */
+	struct vm_area_struct *next;    /* vma after the munmap area */
 	struct list_head *uf;           /* Userfaultfd list_head */
 	unsigned long start;            /* Aligned start addr (inclusive) */
 	unsigned long end;              /* Aligned end addr (exclusive) */
 	int vma_count;                  /* Number of vmas that will be removed */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
+	unsigned long nr_accounted;     /* Number of VM_ACCOUNT pages */
+	unsigned long exec_vm;
+	unsigned long stack_vm;
+	unsigned long data_vm;
 	bool unlock;                    /* Unlock after the munmap */
 };
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 07/21] mm/vma: Extract validate_mm() from vma_complete()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (5 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 06/21] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 08/21] mm/vma: Inline munmap operation in mmap_region() Liam R. Howlett
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	Liam R . Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

vma_complete() will need to be called during an unsafe time to call
validate_mm().  Extract the call in all places now so that only one
location can be modified in the next change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 1 +
 mm/vma.c  | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 548bc45a27bf..dce1cc74ecdb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1796,6 +1796,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		vma_iter_store(vmi, vma);
 
 		vma_complete(&vp, vmi, mm);
+		validate_mm(mm);
 		khugepaged_enter_vma(vma, flags);
 		goto out;
 	}
diff --git a/mm/vma.c b/mm/vma.c
index 6d042cd46cdb..4e08c1654bdd 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -269,6 +269,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	/* vma_complete stores the new vma */
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 
 	/* Success. */
 	if (new_below)
@@ -548,6 +549,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -589,6 +591,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 }
 
@@ -668,7 +671,6 @@ void vma_complete(struct vma_prepare *vp,
 	}
 	if (vp->insert && vp->file)
 		uprobe_mmap(vp->insert);
-	validate_mm(mm);
 }
 
 /*
@@ -1197,6 +1199,7 @@ static struct vm_area_struct
 	}
 
 	vma_complete(&vp, vmi, mm);
+	validate_mm(mm);
 	khugepaged_enter_vma(res, vm_flags);
 	return res;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 08/21] mm/vma: Inline munmap operation in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (6 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 07/21] mm/vma: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 09/21] mm/vma: Expand mmap_region() munmap call Liam R. Howlett
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.

The inlining of the function and checks is an intermediate step in the
series so future patches are easier to follow.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index dce1cc74ecdb..ec72f05b05f2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1388,12 +1388,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}
 
-	/* Unmap any existing mapping in the area */
-	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
-	if (error == -EPERM)
-		return error;
-	else if (error)
-		return -ENOMEM;
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	if (vma) {
+		/* Unmap any existing mapping in the area */
+		error = do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false);
+		if (error)
+			return error;
+		vma = NULL;
+	}
 
 	/*
 	 * Private writable mapping: check memory availability
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 09/21] mm/vma: Expand mmap_region() munmap call
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (7 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 08/21] mm/vma: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 10/21] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.

This requires exposing a few more vma operations.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 26 ++++++++++++++++++++++----
 mm/vma.c  | 31 ++-----------------------------
 mm/vma.h  | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ec72f05b05f2..84cb4b1df4a2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1366,6 +1366,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
@@ -1391,11 +1394,28 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	if (vma) {
-		/* Unmap any existing mapping in the area */
-		error = do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false);
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+		/* Prepare to unmap any existing mapping in the area */
+		error = vms_gather_munmap_vmas(&vms, &mas_detach);
 		if (error)
 			return error;
+
+		/* Remove any existing mappings from the vma tree */
+		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
+			return -ENOMEM;
+
+		/* Unmap any existing mapping in the area */
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+		next = vms.next;
+		prev = vms.prev;
+		vma_prev(&vmi);
 		vma = NULL;
+	} else {
+		next = vma_next(&vmi);
+		prev = vma_prev(&vmi);
 	}
 
 	/*
@@ -1408,8 +1428,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	next = vma_next(&vmi);
-	prev = vma_prev(&vmi);
 	if (vm_flags & VM_SPECIAL) {
 		if (prev)
 			vma_iter_next_range(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 4e08c1654bdd..fc425eb34bf7 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -80,33 +80,6 @@ static void init_multi_vma_prep(struct vma_prepare *vp,
 
 }
 
-/*
- * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
- * @vms: The vma munmap struct
- * @vmi: The vma iterator
- * @vma: The first vm_area_struct to munmap
- * @start: The aligned start address to munmap
- * @end: The aligned end address to munmap
- * @uf: The userfaultfd list_head
- * @unlock: Unlock after the operation.  Only unlocked on success
- */
-static inline void init_vma_munmap(struct vma_munmap_struct *vms,
-		struct vma_iterator *vmi, struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
-{
-	vms->vmi = vmi;
-	vms->vma = vma;
-	vms->mm = vma->vm_mm;
-	vms->start = start;
-	vms->end = end;
-	vms->unlock = unlock;
-	vms->uf = uf;
-	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
-	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
-}
-
 /*
  * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
  * in front of (at a lower virtual address and file offset than) the vma.
@@ -698,7 +671,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
@@ -752,7 +725,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
  *
  * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise
  */
-static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
diff --git a/mm/vma.h b/mm/vma.h
index cbf55e0e0c4f..e78b24d1cf83 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -78,6 +78,39 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff);
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	vms->mm = vma->vm_mm;
+	vms->start = start;
+	vms->end = end;
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+}
+
+int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach);
+
+void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach);
+
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 10/21] mm/vma: Support vma == NULL in init_vma_munmap()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (8 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 09/21] mm/vma: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Adding support for a NULL vma means the init_vma_munmap() can be
initialized for a less error-prone process when calling
vms_complete_munmap_vmas() later on.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c |  2 +-
 mm/vma.h  | 11 ++++++++---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 84cb4b1df4a2..ac348ae933ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1393,11 +1393,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
+	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
 	if (vma) {
 		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 		mt_on_stack(mt_detach);
 		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
-		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
 		/* Prepare to unmap any existing mapping in the area */
 		error = vms_gather_munmap_vmas(&vms, &mas_detach);
 		if (error)
diff --git a/mm/vma.h b/mm/vma.h
index e78b24d1cf83..0e214bbf443e 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -95,9 +95,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 {
 	vms->vmi = vmi;
 	vms->vma = vma;
-	vms->mm = vma->vm_mm;
-	vms->start = start;
-	vms->end = end;
+	if (vma) {
+		vms->mm = vma->vm_mm;
+		vms->start = start;
+		vms->end = end;
+	} else {
+		vms->mm = NULL;
+		vms->start = vms->end = 0;
+	}
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 11/21] mm/mmap: Reposition vma iterator in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (9 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 10/21] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 12/21] mm/vma: Track start and end for munmap in vma_munmap_struct Liam R. Howlett
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location.  Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.

The vma_prev() call in the case of merging the previous vma is
equivalent to vma_iter_prev_range(), since the vma iterator will be
pointing to the location just before the previous vma.

This change needs to export abort_munmap_vmas() from mm/vma.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 40 +++++++++++++++++++++++-----------------
 mm/vma.c  | 16 ----------------
 mm/vma.h  | 16 ++++++++++++++++
 3 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ac348ae933ba..08cf9199f314 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1401,21 +1401,23 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		/* Prepare to unmap any existing mapping in the area */
 		error = vms_gather_munmap_vmas(&vms, &mas_detach);
 		if (error)
-			return error;
+			goto gather_failed;
 
 		/* Remove any existing mappings from the vma tree */
-		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
-			return -ENOMEM;
+		error = vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL);
+		if (error)
+			goto clear_tree_failed;
 
 		/* Unmap any existing mapping in the area */
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
-		vma_prev(&vmi);
 		vma = NULL;
 	} else {
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
+		if (prev)
+			vma_iter_next_range(&vmi);
 	}
 
 	/*
@@ -1428,11 +1430,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	if (vm_flags & VM_SPECIAL) {
-		if (prev)
-			vma_iter_next_range(&vmi);
+	if (vm_flags & VM_SPECIAL)
 		goto cannot_expand;
-	}
 
 	/* Attempt to expand an old mapping */
 	/* Check next */
@@ -1453,19 +1452,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
-	} else if (prev) {
-		vma_iter_next_range(&vmi);
+		vma_prev(&vmi); /* Equivalent to going to the previous range */
 	}
 
-	/* Actually expand, if possible */
-	if (vma &&
-	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
-		khugepaged_enter_vma(vma, vm_flags);
-		goto expanded;
+	if (vma) {
+		/* Actually expand, if possible */
+		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+			khugepaged_enter_vma(vma, vm_flags);
+			goto expanded;
+		}
+
+		/* If the expand fails, then reposition the vma iterator */
+		if (unlikely(vma == prev))
+			vma_iter_set(&vmi, addr);
 	}
 
-	if (vma == prev)
-		vma_iter_set(&vmi, addr);
 cannot_expand:
 
 	/*
@@ -1624,6 +1625,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
+
+clear_tree_failed:
+	if (vms.vma_count)
+		abort_munmap_vmas(&mas_detach);
+gather_failed:
 	validate_mm(mm);
 	return error;
 }
diff --git a/mm/vma.c b/mm/vma.c
index fc425eb34bf7..5e0ed5d63877 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -646,22 +646,6 @@ void vma_complete(struct vma_prepare *vp,
 		uprobe_mmap(vp->insert);
 }
 
-/*
- * abort_munmap_vmas - Undo any munmap work and free resources
- *
- * Reattach any detached vmas and free up the maple tree used to track the vmas.
- */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach)
-{
-	struct vm_area_struct *vma;
-
-	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
-
-	__mt_destroy(mas_detach->tree);
-}
-
 /*
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
diff --git a/mm/vma.h b/mm/vma.h
index 0e214bbf443e..c85fc7c888a8 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -116,6 +116,22 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach);
 
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 12/21] mm/vma: Track start and end for munmap in vma_munmap_struct
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (10 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 13/21] mm: Clean up unmap_region() argument list Liam R. Howlett
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Set the start and end address for munmap when the prev and next are
gathered.  This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.

Add a new helper vms_complete_pte_clear(), which is needed later and
will avoid growing the argument list to unmap_region() beyond the 9 it
already has.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 32 +++++++++++++++++++++++++-------
 mm/vma.h |  6 ++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 5e0ed5d63877..c71dda026191 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -646,6 +646,26 @@ void vma_complete(struct vma_prepare *vp,
 		uprobe_mmap(vp->insert);
 }
 
+static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach, bool mm_wr_locked)
+{
+	struct mmu_gather tlb;
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, vms->mm);
+	update_hiwater_rss(vms->mm);
+	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+	mas_set(mas_detach, 1);
+	/* start and end may be different if there is no prev or next vma. */
+	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+	tlb_finish_mmu(&tlb);
+}
+
 /*
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
@@ -667,13 +687,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
-		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -745,6 +759,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto start_split_failed;
 	}
 	vms->prev = vma_prev(vms->vmi);
+	if (vms->prev)
+		vms->unmap_start = vms->prev->vm_end;
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -805,6 +821,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	}
 
 	vms->next = vma_next(vms->vmi);
+	if (vms->next)
+		vms->unmap_end = vms->next->vm_start;
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
diff --git a/mm/vma.h b/mm/vma.h
index c85fc7c888a8..8afaa661224d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -38,6 +38,8 @@ struct vma_munmap_struct {
 	struct list_head *uf;           /* Userfaultfd list_head */
 	unsigned long start;            /* Aligned start addr (inclusive) */
 	unsigned long end;              /* Aligned end addr (exclusive) */
+	unsigned long unmap_start;      /* Unmap PTE start */
+	unsigned long unmap_end;        /* Unmap PTE end */
 	int vma_count;                  /* Number of vmas that will be removed */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
@@ -78,6 +80,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff);
 
+#ifdef CONFIG_MMU
 /*
  * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
  * @vms: The vma munmap struct
@@ -108,7 +111,10 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->vma_count = 0;
 	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
 }
+#endif
 
 int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 13/21] mm: Clean up unmap_region() argument list
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (11 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 12/21] mm/vma: Track start and end for munmap in vma_munmap_struct Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

With the only caller to unmap_region() being the error path of
mmap_region(), the argument list can be significantly reduced.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c |  3 +--
 mm/vma.c  | 17 ++++++++---------
 mm/vma.h  |  6 ++----
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 08cf9199f314..304dc085533a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1615,8 +1615,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 		vma_iter_set(&vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
-		unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
-			     vma->vm_end, vma->vm_end, true);
+		unmap_region(&vmi.mas, vma, prev, next);
 	}
 	if (writable_file_mapping)
 		mapping_unmap_writable(file->f_mapping);
diff --git a/mm/vma.c b/mm/vma.c
index c71dda026191..83c5c46c67b9 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -155,22 +155,21 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
  *
  * Called with the mm semaphore held.
  */
-void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked)
+void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		struct vm_area_struct *prev, struct vm_area_struct *next)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long mt_start = mas->index;
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
-	mas_set(mas, mt_start);
+	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+		   /* mm_wr_locked = */ true);
+	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING,
-				 mm_wr_locked);
+		      next ? next->vm_start : USER_PGTABLES_CEILING,
+		      /* mm_wr_locked = */ true);
 	tlb_finish_mmu(&tlb);
 }
 
diff --git a/mm/vma.h b/mm/vma.h
index 8afaa661224d..82ba48174341 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -149,10 +149,8 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 
 void remove_vma(struct vm_area_struct *vma, bool unreachable);
 
-void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
+void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		struct vm_area_struct *prev, struct vm_area_struct *next);
 
 /* Required by mmap_region(). */
 bool
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (12 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 13/21] mm: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-10-07 19:05   ` [BUG] page table UAF, " Jann Horn
  2024-08-30  4:00 ` [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().

To ensure locking is downgraded correctly, the mm is set regardless of
MAP_FIXED or not (NULL vma).

If a driver is mapping over an existing vma, then clear the ptes before
the call_mmap() invocation.  This is done using the vms_clean_up_area()
helper.  If there is a close vm_ops, that must also be called to ensure
any cleanup is done before mapping over the area.  This also means that
calling open has been added to the abort of an unmap operation, for now.

Since vm_ops->open() and vm_ops->close() are not always undo each other
(state cleanup may exist in ->close() that is lost forever), the code
cannot be left in this way, but that change has been isolated to another
commit to make this point very obvious for traceability.

Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.

This also drops the validate_mm() call in the vma_expand() function.
It is necessary to drop the validate as it would fail since the mm
map_count would be incorrect during a vma expansion, prior to the
cleanup from vms_complete_munmap_vmas().

Clean up the error handing of the vms_gather_munmap_vmas() by calling
the verification within the function.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 57 ++++++++++++++++++++++++++-----------------------------
 mm/vma.c  | 54 ++++++++++++++++++++++++++++++++++++++++------------
 mm/vma.h  | 22 +++++++++++++++------
 3 files changed, 85 insertions(+), 48 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 304dc085533a..405e0432c78e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1373,23 +1373,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
 	pgoff_t vm_pgoff;
-	int error;
+	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
+	unsigned long nr_pages, nr_accounted;
 
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
+	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
 
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, end);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
+	/*
+	 * Check against address space limit.
+	 * MAP_FIXED may remove pages of mappings that intersects with requested
+	 * mapping. Account for the pages it would unmap.
+	 */
+	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+		return -ENOMEM;
 
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
@@ -1403,13 +1399,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		if (error)
 			goto gather_failed;
 
-		/* Remove any existing mappings from the vma tree */
-		error = vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL);
-		if (error)
-			goto clear_tree_failed;
-
-		/* Unmap any existing mapping in the area */
-		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
 		vma = NULL;
@@ -1425,8 +1414,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = len >> PAGE_SHIFT;
+		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
-			return -ENOMEM;
+			goto abort_munmap;
+		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
 
@@ -1475,10 +1466,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * not unmapped, but the maps are removed from the list.
 	 */
 	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
+	if (!vma)
 		goto unacct_error;
-	}
 
 	vma_iter_config(&vmi, addr, end);
 	vma_set_range(vma, addr, end, pgoff);
@@ -1487,6 +1476,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	if (file) {
 		vma->vm_file = get_file(file);
+		/*
+		 * call_mmap() may map PTE, so ensure there are no existing PTEs
+		 * call the vm_ops close function if one exists.
+		 */
+		vms_clean_up_area(&vms, &mas_detach, true);
 		error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -1577,6 +1571,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 expanded:
 	perf_event_mmap(vma);
 
+	/* Unmap any existing mapping in the area */
+	vms_complete_munmap_vmas(&vms, &mas_detach);
+
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -1605,7 +1602,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	return addr;
 
 close_and_free_vma:
-	if (file && vma->vm_ops && vma->vm_ops->close)
+	if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 
 	if (file || vma->vm_file) {
@@ -1625,9 +1622,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (charged)
 		vm_unacct_memory(charged);
 
-clear_tree_failed:
-	if (vms.vma_count)
-		abort_munmap_vmas(&mas_detach);
+abort_munmap:
+	if (vms.nr_pages)
+		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
 gather_failed:
 	validate_mm(mm);
 	return error;
@@ -1960,7 +1957,7 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, true);
+		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 83c5c46c67b9..648c58da8ad4 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -136,10 +136,10 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
 {
 	might_sleep();
-	if (vma->vm_ops && vma->vm_ops->close)
+	if (!closed && vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
@@ -521,7 +521,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
-	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -645,11 +644,14 @@ void vma_complete(struct vma_prepare *vp,
 		uprobe_mmap(vp->insert);
 }
 
-static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach, bool mm_wr_locked)
+static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
+		    struct ma_state *mas_detach, bool mm_wr_locked)
 {
 	struct mmu_gather tlb;
 
+	if (!vms->clear_ptes) /* Nothing to do */
+		return;
+
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
@@ -658,11 +660,31 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, vms->mm);
 	update_hiwater_rss(vms->mm);
-	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
+		   vms->vma_count, mm_wr_locked);
+
 	mas_set(mas_detach, 1);
 	/* start and end may be different if there is no prev or next vma. */
-	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
+		      vms->unmap_end, mm_wr_locked);
 	tlb_finish_mmu(&tlb);
+	vms->clear_ptes = false;
+}
+
+void vms_clean_up_area(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach, bool mm_wr_locked)
+{
+	struct vm_area_struct *vma;
+
+	if (!vms->nr_pages)
+		return;
+
+	vms_clear_ptes(vms, mas_detach, mm_wr_locked);
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		if (vma->vm_ops && vma->vm_ops->close)
+			vma->vm_ops->close(vma);
+	vms->closed_vm_ops = true;
 }
 
 /*
@@ -686,7 +708,10 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
+	if (!vms->nr_pages)
+		return;
+
+	vms_clear_ptes(vms, mas_detach, !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -702,7 +727,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, false);
+		remove_vma(vma, /* = */ false, vms->closed_vm_ops);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
@@ -846,13 +871,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	while (vma_iter_addr(vms->vmi) > vms->start)
 		vma_iter_prev_range(vms->vmi);
 
+	vms->clear_ptes = true;
 	return 0;
 
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
 modify_vma_failed:
-	abort_munmap_vmas(mas_detach);
+	abort_munmap_vmas(mas_detach, /* closed = */ false);
 start_split_failed:
 map_count_exceeded:
 	return error;
@@ -897,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-	abort_munmap_vmas(&mas_detach);
+	abort_munmap_vmas(&mas_detach, /* closed = */ false);
 gather_failed:
 	validate_mm(mm);
 	return error;
@@ -1615,17 +1641,21 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 }
 
 unsigned long count_vma_pages_range(struct mm_struct *mm,
-				    unsigned long addr, unsigned long end)
+		unsigned long addr, unsigned long end,
+		unsigned long *nr_accounted)
 {
 	VMA_ITERATOR(vmi, mm, addr);
 	struct vm_area_struct *vma;
 	unsigned long nr_pages = 0;
 
+	*nr_accounted = 0;
 	for_each_vma_range(vmi, vma, end) {
 		unsigned long vm_start = max(addr, vma->vm_start);
 		unsigned long vm_end = min(end, vma->vm_end);
 
 		nr_pages += PHYS_PFN(vm_end - vm_start);
+		if (vma->vm_flags & VM_ACCOUNT)
+			*nr_accounted += PHYS_PFN(vm_end - vm_start);
 	}
 
 	return nr_pages;
diff --git a/mm/vma.h b/mm/vma.h
index 82ba48174341..64b44f5a0a11 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -48,6 +48,8 @@ struct vma_munmap_struct {
 	unsigned long stack_vm;
 	unsigned long data_vm;
 	bool unlock;                    /* Unlock after the munmap */
+	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
+	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
 };
 
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
@@ -96,14 +98,13 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 		unsigned long start, unsigned long end, struct list_head *uf,
 		bool unlock)
 {
+	vms->mm = current->mm;
 	vms->vmi = vmi;
 	vms->vma = vma;
 	if (vma) {
-		vms->mm = vma->vm_mm;
 		vms->start = start;
 		vms->end = end;
 	} else {
-		vms->mm = NULL;
 		vms->start = vms->end = 0;
 	}
 	vms->unlock = unlock;
@@ -113,6 +114,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->clear_ptes = false;
+	vms->closed_vm_ops = false;
 }
 #endif
 
@@ -122,18 +125,24 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach);
 
+void vms_clean_up_area(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach, bool mm_wr_locked);
+
 /*
  * abort_munmap_vmas - Undo any munmap work and free resources
  *
  * Reattach any detached vmas and free up the maple tree used to track the vmas.
  */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
 {
 	struct vm_area_struct *vma;
 
 	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
+	mas_for_each(mas_detach, vma, ULONG_MAX) {
 		vma_mark_detached(vma, false);
+		if (closed && vma->vm_ops && vma->vm_ops->open)
+			vma->vm_ops->open(vma);
+	}
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -147,7 +156,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);
 
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
@@ -261,7 +270,8 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 int mm_take_all_locks(struct mm_struct *mm);
 void mm_drop_all_locks(struct mm_struct *mm);
 unsigned long count_vma_pages_range(struct mm_struct *mm,
-				    unsigned long addr, unsigned long end);
+				    unsigned long addr, unsigned long end,
+				    unsigned long *nr_accounted);
 
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (13 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-09-03  3:07   ` Pengfei Xu
  2024-08-30  4:00 ` [PATCH v8 16/21] mm/mmap: Use PHYS_PFN in mmap_region() Liam R. Howlett
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Prior to call_mmap(), the vmas that will be replaced need to clear the
way for what may happen in the call_mmap().  This clean up work includes
clearing the ptes and calling the close() vm_ops.  Some users do more
setup than can be restored by calling the vm_ops open() function.  It is
safer to store the gap in the vma tree in these cases.

That is to say that the failure scenario that existed before the
MAP_FIXED gap exposure is restored as it is safer than trying to undo a
partial mapping.

Since abort_munmap_vmas() is only reattaching vmas with this change, the
function is renamed to reattach_vmas().

There is also a secondary failure that may occur if there is not enough
memory to store the gap.  In this case, the vmas are reattached and
resources freed.  If the system cannot complete the call_mmap() and
fails to allocate with GFP_KERNEL, then the system will print a warning
about the failure.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c |  3 +--
 mm/vma.c  |  4 +--
 mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
 3 files changed, 61 insertions(+), 26 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 405e0432c78e..e1e5c78b6c3c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_unacct_memory(charged);
 
 abort_munmap:
-	if (vms.nr_pages)
-		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
+	vms_abort_munmap_vmas(&vms, &mas_detach);
 gather_failed:
 	validate_mm(mm);
 	return error;
diff --git a/mm/vma.c b/mm/vma.c
index 648c58da8ad4..d2d71d659d1e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 munmap_gather_failed:
 end_split_failed:
 modify_vma_failed:
-	abort_munmap_vmas(mas_detach, /* closed = */ false);
+	reattach_vmas(mas_detach);
 start_split_failed:
 map_count_exceeded:
 	return error;
@@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-	abort_munmap_vmas(&mas_detach, /* closed = */ false);
+	reattach_vmas(&mas_detach);
 gather_failed:
 	validate_mm(mm);
 	return error;
diff --git a/mm/vma.h b/mm/vma.h
index 64b44f5a0a11..b2306d13d456 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff);
 
+static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
+			struct vm_area_struct *vma, gfp_t gfp)
+
+{
+	if (vmi->mas.status != ma_start &&
+	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
+		vma_iter_invalidate(vmi);
+
+	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
+	mas_store_gfp(&vmi->mas, vma, gfp);
+	if (unlikely(mas_is_err(&vmi->mas)))
+		return -ENOMEM;
+
+	return 0;
+}
+
 #ifdef CONFIG_MMU
 /*
  * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
@@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach, bool mm_wr_locked);
 
 /*
- * abort_munmap_vmas - Undo any munmap work and free resources
+ * reattach_vmas() - Undo any munmap work and free resources
+ * @mas_detach: The maple state with the detached maple tree
  *
  * Reattach any detached vmas and free up the maple tree used to track the vmas.
  */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
+static inline void reattach_vmas(struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
 
 	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX) {
+	mas_for_each(mas_detach, vma, ULONG_MAX)
 		vma_mark_detached(vma, false);
-		if (closed && vma->vm_ops && vma->vm_ops->open)
-			vma->vm_ops->open(vma);
-	}
 
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
+ * operation.
+ * @vms: The vma unmap structure
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas, free up the maple tree used to track the vmas.
+ * If that's not possible because the ptes are cleared (and vm_ops->closed() may
+ * have been called), then a NULL is written over the vmas and the vmas are
+ * removed (munmap() completed).
+ */
+static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
+{
+	struct ma_state *mas = &vms->vmi->mas;
+	if (!vms->nr_pages)
+		return;
+
+	if (vms->clear_ptes)
+		return reattach_vmas(mas_detach);
+
+	/*
+	 * Aborting cannot just call the vm_ops open() because they are often
+	 * not symmetrical and state data has been lost.  Resort to the old
+	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
+	 */
+	mas_set_range(mas, vms->start, vms->end);
+	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
+		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
+			     current->comm, current->pid);
+		/* Leaving vmas detached and in-tree may hamper recovery */
+		reattach_vmas(mas_detach);
+	} else {
+		/* Clean up the insertion of the unfortunate gap */
+		vms_complete_munmap_vmas(vms, mas_detach);
+	}
+}
+
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
@@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
 	return mas_prev(&vmi->mas, min);
 }
 
-static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
-			struct vm_area_struct *vma, gfp_t gfp)
-{
-	if (vmi->mas.status != ma_start &&
-	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
-		vma_iter_invalidate(vmi);
-
-	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
-	mas_store_gfp(&vmi->mas, vma, gfp);
-	if (unlikely(mas_is_err(&vmi->mas)))
-		return -ENOMEM;
-
-	return 0;
-}
-
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 16/21] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (14 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 17/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
existing local variable everywhere instead of some of the time.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e1e5c78b6c3c..cd09dd164e85 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1364,7 +1364,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct vm_area_struct *next, *prev, *merge;
-	pgoff_t pglen = len >> PAGE_SHIFT;
+	pgoff_t pglen = PHYS_PFN(len);
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
@@ -1384,7 +1384,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * MAP_FIXED may remove pages of mappings that intersects with requested
 	 * mapping. Account for the pages it would unmap.
 	 */
-	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
 		return -ENOMEM;
 
 	/* Find the first overlapping VMA */
@@ -1413,7 +1413,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Private writable mapping: check memory availability
 	 */
 	if (accountable_mapping(file, vm_flags)) {
-		charged = len >> PAGE_SHIFT;
+		charged = pglen;
 		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
@@ -1574,14 +1574,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Unmap any existing mapping in the area */
 	vms_complete_munmap_vmas(&vms, &mas_detach);
 
-	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
+			mm->locked_vm += pglen;
 	}
 
 	if (file)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 17/21] mm/mmap: Use vms accounted pages in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (15 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 16/21] mm/mmap: Use PHYS_PFN in mmap_region() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 18/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett,
	linux-security-module, Paul Moore

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation.  This is necessary for a future patch.

This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-security-module@vger.kernel.org
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM)
---
 mm/mmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index cd09dd164e85..4faadc54e89d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1414,9 +1414,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = pglen;
-		charged -= nr_accounted;
-		if (security_vm_enough_memory_mm(mm, charged))
+		charged -= vms.nr_accounted;
+		if (charged && security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
+
 		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 18/21] ipc/shm, mm: Drop do_vma_munmap()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (16 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 17/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:00 ` [PATCH v8 19/21] mm: Move may_expand_vm() check in mmap_region() Liam R. Howlett
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap().  All callers now use a vma iterator and since the
mseal check has been moved to do_vmi_align_munmap() and the vmas are
aligned, this function can just be called instead.

do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h |  6 +++---
 ipc/shm.c          |  8 ++++----
 mm/mmap.c          | 33 ++++++---------------------------
 mm/vma.c           | 12 ++++++------
 mm/vma.h           |  4 +---
 5 files changed, 20 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f6053d7ea6f..b0ff06d18c71 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3287,14 +3287,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock);
+int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool unlock);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
 
 #ifdef CONFIG_MMU
-extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
-			 unsigned long start, unsigned long end,
-			 struct list_head *uf, bool unlock);
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
 static inline void mm_populate(unsigned long addr, unsigned long len)
diff --git a/ipc/shm.c b/ipc/shm.c
index 3e3071252dac..99564c870084 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
 			 */
 			file = vma->vm_file;
 			size = i_size_read(file_inode(vma->vm_file));
-			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
-				      NULL, false);
+			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+					    vma->vm_end, NULL, false);
 			/*
 			 * We discovered the size of the shm segment, so
 			 * break out of here and fall through to the next
@@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
 		if ((vma->vm_ops == &shm_vm_ops) &&
 		    ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
 		    (vma->vm_file == file)) {
-			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
-				      NULL, false);
+			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+					    vma->vm_end, NULL, false);
 		}
 
 		vma = vma_next(&vmi);
diff --git a/mm/mmap.c b/mm/mmap.c
index 4faadc54e89d..6485f2300692 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -169,11 +169,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 			goto out; /* mapping intersects with an existing non-brk vma. */
 		/*
 		 * mm->brk must be protected by write mmap_lock.
-		 * do_vma_munmap() will drop the lock on success,  so update it
-		 * before calling do_vma_munmap().
+		 * do_vmi_align_munmap() will drop the lock on success,  so
+		 * update it before calling do_vma_munmap().
 		 */
 		mm->brk = brk;
-		if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
+		if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
+					/* unlock = */ true))
 			goto out;
 
 		goto success_unlocked;
@@ -1479,9 +1480,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vma->vm_file = get_file(file);
 		/*
 		 * call_mmap() may map PTE, so ensure there are no existing PTEs
-		 * call the vm_ops close function if one exists.
+		 * and call the vm_ops close function if one exists.
 		 */
-		vms_clean_up_area(&vms, &mas_detach, true);
+		vms_clean_up_area(&vms, &mas_detach);
 		error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -1744,28 +1745,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	return ret;
 }
 
-/*
- * do_vma_munmap() - Unmap a full or partial vma.
- * @vmi: The vma iterator pointing at the vma
- * @vma: The first vma to be munmapped
- * @start: the start of the address to unmap
- * @end: The end of the address to unmap
- * @uf: The userfaultfd list_head
- * @unlock: Drop the lock on success
- *
- * unmaps a VMA mapping when the vma iterator is already in position.
- * Does not handle alignment.
- *
- * Return: 0 on success drops the lock of so directed, error on failure and will
- * still hold the lock.
- */
-int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
-{
-	return do_vmi_align_munmap(vmi, vma, vma->vm_mm, start, end, uf, unlock);
-}
-
 /*
  * do_brk_flags() - Increase the brk vma if the flags match.
  * @vmi: The vma iterator
diff --git a/mm/vma.c b/mm/vma.c
index d2d71d659d1e..713b2196d351 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -658,8 +658,8 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 	 */
 	mas_set(mas_detach, 1);
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, vms->mm);
-	update_hiwater_rss(vms->mm);
+	tlb_gather_mmu(&tlb, vms->vma->vm_mm);
+	update_hiwater_rss(vms->vma->vm_mm);
 	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
 		   vms->vma_count, mm_wr_locked);
 
@@ -672,14 +672,14 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 }
 
 void vms_clean_up_area(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach, bool mm_wr_locked)
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
 
 	if (!vms->nr_pages)
 		return;
 
-	vms_clear_ptes(vms, mas_detach, mm_wr_locked);
+	vms_clear_ptes(vms, mas_detach, true);
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
 		if (vma->vm_ops && vma->vm_ops->close)
@@ -702,7 +702,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
-	mm = vms->mm;
+	mm = current->mm;
 	mm->map_count -= vms->vma_count;
 	mm->locked_vm -= vms->locked_vm;
 	if (vms->unlock)
@@ -770,7 +770,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		 * its limit temporarily, to help free resources as expected.
 		 */
 		if (vms->end < vms->vma->vm_end &&
-		    vms->mm->map_count >= sysctl_max_map_count)
+		    vms->vma->vm_mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
 		/* Don't bother splitting the VMA if we can't unmap it anyway */
diff --git a/mm/vma.h b/mm/vma.h
index b2306d13d456..eea1b5e9c15a 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -31,7 +31,6 @@ struct unlink_vma_file_batch {
  */
 struct vma_munmap_struct {
 	struct vma_iterator *vmi;
-	struct mm_struct *mm;
 	struct vm_area_struct *vma;     /* The first vma to munmap */
 	struct vm_area_struct *prev;    /* vma before the munmap area */
 	struct vm_area_struct *next;    /* vma after the munmap area */
@@ -114,7 +113,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 		unsigned long start, unsigned long end, struct list_head *uf,
 		bool unlock)
 {
-	vms->mm = current->mm;
 	vms->vmi = vmi;
 	vms->vma = vma;
 	if (vma) {
@@ -142,7 +140,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach);
 
 void vms_clean_up_area(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach, bool mm_wr_locked);
+		struct ma_state *mas_detach);
 
 /*
  * reattach_vmas() - Undo any munmap work and free resources
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 19/21] mm: Move may_expand_vm() check in mmap_region()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (17 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 18/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-08-30  4:00 ` Liam R. Howlett
  2024-08-30  4:01 ` [PATCH v8 20/21] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The may_expand_vm() check requires the count of the pages within the
munmap range.  Since this is needed for accounting and obtained later,
the reodering of ma_expand_vm() to later in the call stack, after the
vma munmap struct (vms) is initialised and the gather stage is
potentially run, will allow for a single loop over the vmas.  The gather
sage does not commit any work and so everything can be undone in the
case of a failure.

The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 15 ++++-----------
 mm/vma.c  | 21 ---------------------
 mm/vma.h  |  3 ---
 3 files changed, 4 insertions(+), 35 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6485f2300692..fc726c4e98be 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1376,17 +1376,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	pgoff_t vm_pgoff;
 	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
-	unsigned long nr_pages, nr_accounted;
-
-	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
-	/*
-	 * Check against address space limit.
-	 * MAP_FIXED may remove pages of mappings that intersects with requested
-	 * mapping. Account for the pages it would unmap.
-	 */
-	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
-		return -ENOMEM;
 
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
@@ -1410,6 +1399,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			vma_iter_next_range(&vmi);
 	}
 
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+		goto abort_munmap;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
diff --git a/mm/vma.c b/mm/vma.c
index 713b2196d351..3350fe2088bc 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1640,27 +1640,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 	return vma_fs_can_writeback(vma);
 }
 
-unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end,
-		unsigned long *nr_accounted)
-{
-	VMA_ITERATOR(vmi, mm, addr);
-	struct vm_area_struct *vma;
-	unsigned long nr_pages = 0;
-
-	*nr_accounted = 0;
-	for_each_vma_range(vmi, vma, end) {
-		unsigned long vm_start = max(addr, vma->vm_start);
-		unsigned long vm_end = min(end, vma->vm_end);
-
-		nr_pages += PHYS_PFN(vm_end - vm_start);
-		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += PHYS_PFN(vm_end - vm_start);
-	}
-
-	return nr_pages;
-}
-
 static DEFINE_MUTEX(mm_all_locks_mutex);
 
 static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
diff --git a/mm/vma.h b/mm/vma.h
index eea1b5e9c15a..6900e95bcaac 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -319,9 +319,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 
 int mm_take_all_locks(struct mm_struct *mm);
 void mm_drop_all_locks(struct mm_struct *mm);
-unsigned long count_vma_pages_range(struct mm_struct *mm,
-				    unsigned long addr, unsigned long end,
-				    unsigned long *nr_accounted);
 
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 20/21] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas()
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (18 preceding siblings ...)
  2024-08-30  4:00 ` [PATCH v8 19/21] mm: Move may_expand_vm() check in mmap_region() Liam R. Howlett
@ 2024-08-30  4:01 ` Liam R. Howlett
  2024-08-30  4:01 ` [PATCH v8 21/21] mm/vma.h: Optimise vma_munmap_struct Liam R. Howlett
  2024-08-30 16:05 ` [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Jeff Xu
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()").  The move_vma() was altered to fix the
fragile state of the accounting since then.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 3350fe2088bc..1736bb237b2c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -755,13 +755,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
-	 *
-	 * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
-	 * unmapped vm_area_struct will remain in use: so lower split_vma
-	 * places tmp vma above, and higher split_vma places tmp vma below.
+	 * Does it split the first one?
 	 */
-
-	/* Does it split the first one? */
 	if (vms->start > vms->vma->vm_start) {
 
 		/*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v8 21/21] mm/vma.h: Optimise vma_munmap_struct
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (19 preceding siblings ...)
  2024-08-30  4:01 ` [PATCH v8 20/21] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
@ 2024-08-30  4:01 ` Liam R. Howlett
  2024-08-30 16:05 ` [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Jeff Xu
  21 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30  4:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The vma_munmap_struct has a hole of 4 bytes and pushes the struct to
three cachelines.  Relocating the three booleans upwards allows for the
struct to only use two cachelines (as reported by pahole on amd64).

Before:
struct vma_munmap_struct {
        struct vma_iterator *      vmi;                  /*     0     8 */
        struct vm_area_struct *    vma;                  /*     8     8 */
        struct vm_area_struct *    prev;                 /*    16     8 */
        struct vm_area_struct *    next;                 /*    24     8 */
        struct list_head *         uf;                   /*    32     8 */
        long unsigned int          start;                /*    40     8 */
        long unsigned int          end;                  /*    48     8 */
        long unsigned int          unmap_start;          /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int          unmap_end;            /*    64     8 */
        int                        vma_count;            /*    72     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          nr_pages;             /*    80     8 */
        long unsigned int          locked_vm;            /*    88     8 */
        long unsigned int          nr_accounted;         /*    96     8 */
        long unsigned int          exec_vm;              /*   104     8 */
        long unsigned int          stack_vm;             /*   112     8 */
        long unsigned int          data_vm;              /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        bool                       unlock;               /*   128     1 */
        bool                       clear_ptes;           /*   129     1 */
        bool                       closed_vm_ops;        /*   130     1 */

        /* size: 136, cachelines: 3, members: 19 */
        /* sum members: 127, holes: 1, sum holes: 4 */
        /* padding: 5 */
        /* last cacheline: 8 bytes */
};

After:
struct vma_munmap_struct {
        struct vma_iterator *      vmi;                  /*     0     8 */
        struct vm_area_struct *    vma;                  /*     8     8 */
        struct vm_area_struct *    prev;                 /*    16     8 */
        struct vm_area_struct *    next;                 /*    24     8 */
        struct list_head *         uf;                   /*    32     8 */
        long unsigned int          start;                /*    40     8 */
        long unsigned int          end;                  /*    48     8 */
        long unsigned int          unmap_start;          /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int          unmap_end;            /*    64     8 */
        int                        vma_count;            /*    72     4 */
        bool                       unlock;               /*    76     1 */
        bool                       clear_ptes;           /*    77     1 */
        bool                       closed_vm_ops;        /*    78     1 */

        /* XXX 1 byte hole, try to pack */

        long unsigned int          nr_pages;             /*    80     8 */
        long unsigned int          locked_vm;            /*    88     8 */
        long unsigned int          nr_accounted;         /*    96     8 */
        long unsigned int          exec_vm;              /*   104     8 */
        long unsigned int          stack_vm;             /*   112     8 */
        long unsigned int          data_vm;              /*   120     8 */

        /* size: 128, cachelines: 2, members: 19 */
        /* sum members: 127, holes: 1, sum holes: 1 */
};

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/vma.h b/mm/vma.h
index 6900e95bcaac..85616faa4490 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -40,15 +40,16 @@ struct vma_munmap_struct {
 	unsigned long unmap_start;      /* Unmap PTE start */
 	unsigned long unmap_end;        /* Unmap PTE end */
 	int vma_count;                  /* Number of vmas that will be removed */
+	bool unlock;                    /* Unlock after the munmap */
+	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
+	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
+	/* 1 byte hole */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
 	unsigned long nr_accounted;     /* Number of VM_ACCOUNT pages */
 	unsigned long exec_vm;
 	unsigned long stack_vm;
 	unsigned long data_vm;
-	bool unlock;                    /* Unlock after the munmap */
-	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
-	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
 };
 
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 00/21] Avoid MAP_FIXED gap exposure
  2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (20 preceding siblings ...)
  2024-08-30  4:01 ` [PATCH v8 21/21] mm/vma.h: Optimise vma_munmap_struct Liam R. Howlett
@ 2024-08-30 16:05 ` Jeff Xu
  2024-08-30 17:07   ` Liam R. Howlett
  21 siblings, 1 reply; 36+ messages in thread
From: Jeff Xu @ 2024-08-30 16:05 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
	Lorenzo Stoakes, Matthew Wilcox, Vlastimil Babka, sidhartha.kumar,
	Bert Karwatzki, Jiri Olsa, Kees Cook, Paul E . McKenney

Hi Liam

On Thu, Aug 29, 2024 at 9:01 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Changes since v7:
>
> This is all the patches I've sent for v7 fixups plus the return code for
> mseal().  The incorrect return code was introduced in an earlier patch
> and then modified (still incorrectly) later, so this version will
> hopefully bisect cleanly.
>
> - Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
> - Passed through error returned from vms_gather_munmap_vmas() in
>   mmap_region() - Thanks Jeff

I would think it is cleaner to fix the original commit directly
instead of as part of a large patch series, no ?  If not possible,
maybe mm-unstable should apply my fix first then you can refactor
based on that ?

-Jeff


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 00/21] Avoid MAP_FIXED gap exposure
  2024-08-30 16:05 ` [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Jeff Xu
@ 2024-08-30 17:07   ` Liam R. Howlett
  0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-08-30 17:07 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
	Lorenzo Stoakes, Matthew Wilcox, Vlastimil Babka, sidhartha.kumar,
	Bert Karwatzki, Jiri Olsa, Kees Cook, Paul E . McKenney

* Jeff Xu <jeffxu@chromium.org> [240830 12:06]:
> Hi Liam
> 
> On Thu, Aug 29, 2024 at 9:01 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > Changes since v7:
> >
> > This is all the patches I've sent for v7 fixups plus the return code for
> > mseal().  The incorrect return code was introduced in an earlier patch
> > and then modified (still incorrectly) later, so this version will
> > hopefully bisect cleanly.
> >
> > - Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
> > - Passed through error returned from vms_gather_munmap_vmas() in
> >   mmap_region() - Thanks Jeff
> 
> I would think it is cleaner to fix the original commit directly
> instead of as part of a large patch series, no ?  If not possible,
> maybe mm-unstable should apply my fix first then you can refactor
> based on that ?

No, it is not.  These patches are not up stream, so the fixes line will
be stale before it is even available.  No one is affected except the
mm-unstable branch and linux-next.  Patches to commits that are not
upstream can change, and almost always do with fixes like this.

What I have done here is to fix the series so that each patch will keep
passing the mseal() tests.  That way, if there is an issue with mseal(),
it can be git bisected to find the problem without finding a fixed
problem.

If your fix was put into mm-unstable, all linux-next branches would have
an intermittent bug present on bisection, and waiting to respin the
patches means they miss out on testing time.

This is why we squash fixes into patches during development, or ask akpm
to do so by replying to the broken patch.  Andrew prefers not resending
large patch sets because something may be missed, so for smaller changes
they are sent with a request to squash them on his side.

In this case, there would have been 3 or 4 patches that needed to be
updated (two changed, two with merge issues I believe), so I sent a v8
because the alternative would have been confusing.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure
  2024-08-30  4:00 ` [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
@ 2024-09-03  3:07   ` Pengfei Xu
  2024-09-03 11:00     ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Pengfei Xu @ 2024-09-03  3:07 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
	Lorenzo Stoakes, Matthew Wilcox, Vlastimil Babka, sidhartha.kumar,
	Bert Karwatzki, Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu,
	syzkaller-bugs

Hi Liam R. Howlett,

Greetings!

There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
Bisected and found first bad commit:
"
3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
"
It's same as below patch.
After reverted the above commit on top of next-20240902, this issue was gone.

All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log

And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
same commit, all detailed info:
https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect

"
[   19.953726] cgroup: Unknown subsys name 'net'
[   20.045121] cgroup: Unknown subsys name 'rlimit'
[   20.138332] ------------[ cut here ]------------
[   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
[   20.139075] Modules linked in:
[   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
[   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
[   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
[   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
[   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
[   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
[   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
[   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
[   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
[   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
[   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
[   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   20.145665] PKRU: 55555554
[   20.145809] Call Trace:
[   20.145940]  <TASK>
[   20.146056]  ? show_regs+0x6d/0x80
[   20.146247]  ? __warn+0xf3/0x380
[   20.146431]  ? report_bug+0x25e/0x4b0
[   20.146650]  ? __split_vma+0x4d7/0x1020
[   20.146864]  ? report_bug+0x2cb/0x4b0
[   20.147070]  ? __split_vma+0x4d7/0x1020
[   20.147281]  ? __split_vma+0x4d8/0x1020
[   20.147492]  ? handle_bug+0xf1/0x190
[   20.147756]  ? exc_invalid_op+0x3c/0x80
[   20.147971]  ? asm_exc_invalid_op+0x1f/0x30
[   20.148208]  ? __split_vma+0x467/0x1020
[   20.148417]  ? __split_vma+0x4d7/0x1020
[   20.148628]  ? __split_vma+0x4d7/0x1020
[   20.148845]  ? __pfx___split_vma+0x10/0x10
[   20.149068]  ? __kasan_check_read+0x15/0x20
[   20.149300]  ? mark_lock.part.0+0xf3/0x17b0
[   20.149535]  ? __kasan_check_read+0x15/0x20
[   20.149769]  vms_gather_munmap_vmas+0x178/0xf70
[   20.150024]  do_vmi_align_munmap+0x26e/0x640
[   20.150257]  ? __pfx___lock_acquire+0x10/0x10
[   20.150500]  ? __pfx_do_vmi_align_munmap+0x10/0x10
[   20.150758]  ? __this_cpu_preempt_check+0x21/0x30
[   20.151018]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   20.151308]  ? mtree_range_walk+0x728/0xb70
[   20.151602]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   20.151891]  ? mas_walk+0x6a7/0x8b0
[   20.152096]  do_vmi_munmap+0x202/0x400
[   20.152307]  __vm_munmap+0x182/0x380
[   20.152509]  ? __pfx___vm_munmap+0x10/0x10
[   20.152729]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[   20.153061]  ? lockdep_hardirqs_on+0x89/0x110
[   20.153299]  ? trace_hardirqs_on+0x51/0x60
[   20.153533]  ? __audit_syscall_entry+0x39c/0x500
[   20.153790]  __x64_sys_munmap+0x62/0x90
[   20.154001]  x64_sys_call+0x198f/0x2140
[   20.154212]  do_syscall_64+0x6d/0x140
[   20.154414]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   20.154683] RIP: 0033:0x7eff9343ee5d
[   20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
[   20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d
[   20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000
[   20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640
[   20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8
[   20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000
[   20.158122]  </TASK>
[   20.158245] irq event stamp: 2019
[   20.158426] hardirqs last  enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240
[   20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240
[   20.159306] softirqs last  enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
[   20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
[   20.160233] ---[ end trace 0000000000000000 ]---
"

I hope it's helpful.

Thanks!

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.

Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Best Regards,
Thanks!


On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Prior to call_mmap(), the vmas that will be replaced need to clear the
> way for what may happen in the call_mmap().  This clean up work includes
> clearing the ptes and calling the close() vm_ops.  Some users do more
> setup than can be restored by calling the vm_ops open() function.  It is
> safer to store the gap in the vma tree in these cases.
> 
> That is to say that the failure scenario that existed before the
> MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> partial mapping.
> 
> Since abort_munmap_vmas() is only reattaching vmas with this change, the
> function is renamed to reattach_vmas().
> 
> There is also a secondary failure that may occur if there is not enough
> memory to store the gap.  In this case, the vmas are reattached and
> resources freed.  If the system cannot complete the call_mmap() and
> fails to allocate with GFP_KERNEL, then the system will print a warning
> about the failure.
> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/mmap.c |  3 +--
>  mm/vma.c  |  4 +--
>  mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
>  3 files changed, 61 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 405e0432c78e..e1e5c78b6c3c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vm_unacct_memory(charged);
>  
>  abort_munmap:
> -	if (vms.nr_pages)
> -		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> +	vms_abort_munmap_vmas(&vms, &mas_detach);
>  gather_failed:
>  	validate_mm(mm);
>  	return error;
> diff --git a/mm/vma.c b/mm/vma.c
> index 648c58da8ad4..d2d71d659d1e 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  munmap_gather_failed:
>  end_split_failed:
>  modify_vma_failed:
> -	abort_munmap_vmas(mas_detach, /* closed = */ false);
> +	reattach_vmas(mas_detach);
>  start_split_failed:
>  map_count_exceeded:
>  	return error;
> @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return 0;
>  
>  clear_tree_failed:
> -	abort_munmap_vmas(&mas_detach, /* closed = */ false);
> +	reattach_vmas(&mas_detach);
>  gather_failed:
>  	validate_mm(mm);
>  	return error;
> diff --git a/mm/vma.h b/mm/vma.h
> index 64b44f5a0a11..b2306d13d456 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	       unsigned long start, unsigned long end, pgoff_t pgoff);
>  
> +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> +			struct vm_area_struct *vma, gfp_t gfp)
> +
> +{
> +	if (vmi->mas.status != ma_start &&
> +	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> +		vma_iter_invalidate(vmi);
> +
> +	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> +	mas_store_gfp(&vmi->mas, vma, gfp);
> +	if (unlikely(mas_is_err(&vmi->mas)))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
>  #ifdef CONFIG_MMU
>  /*
>   * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
>  		struct ma_state *mas_detach, bool mm_wr_locked);
>  
>  /*
> - * abort_munmap_vmas - Undo any munmap work and free resources
> + * reattach_vmas() - Undo any munmap work and free resources
> + * @mas_detach: The maple state with the detached maple tree
>   *
>   * Reattach any detached vmas and free up the maple tree used to track the vmas.
>   */
> -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> +static inline void reattach_vmas(struct ma_state *mas_detach)
>  {
>  	struct vm_area_struct *vma;
>  
>  	mas_set(mas_detach, 0);
> -	mas_for_each(mas_detach, vma, ULONG_MAX) {
> +	mas_for_each(mas_detach, vma, ULONG_MAX)
>  		vma_mark_detached(vma, false);
> -		if (closed && vma->vm_ops && vma->vm_ops->open)
> -			vma->vm_ops->open(vma);
> -	}
>  
>  	__mt_destroy(mas_detach->tree);
>  }
>  
> +/*
> + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
> + * operation.
> + * @vms: The vma unmap structure
> + * @mas_detach: The maple state with the detached maple tree
> + *
> + * Reattach any detached vmas, free up the maple tree used to track the vmas.
> + * If that's not possible because the ptes are cleared (and vm_ops->closed() may
> + * have been called), then a NULL is written over the vmas and the vmas are
> + * removed (munmap() completed).
> + */
> +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> +		struct ma_state *mas_detach)
> +{
> +	struct ma_state *mas = &vms->vmi->mas;
> +	if (!vms->nr_pages)
> +		return;
> +
> +	if (vms->clear_ptes)
> +		return reattach_vmas(mas_detach);
> +
> +	/*
> +	 * Aborting cannot just call the vm_ops open() because they are often
> +	 * not symmetrical and state data has been lost.  Resort to the old
> +	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
> +	 */
> +	mas_set_range(mas, vms->start, vms->end);
> +	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
> +		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> +			     current->comm, current->pid);
> +		/* Leaving vmas detached and in-tree may hamper recovery */
> +		reattach_vmas(mas_detach);
> +	} else {
> +		/* Clean up the insertion of the unfortunate gap */
> +		vms_complete_munmap_vmas(vms, mas_detach);
> +	}
> +}
> +
>  int
>  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
> @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
>  	return mas_prev(&vmi->mas, min);
>  }
>  
> -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> -			struct vm_area_struct *vma, gfp_t gfp)
> -{
> -	if (vmi->mas.status != ma_start &&
> -	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> -		vma_iter_invalidate(vmi);
> -
> -	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> -	mas_store_gfp(&vmi->mas, vma, gfp);
> -	if (unlikely(mas_is_err(&vmi->mas)))
> -		return -ENOMEM;
> -
> -	return 0;
> -}
> -
> -
>  /*
>   * These three helpers classifies VMAs for virtual memory accounting.
>   */
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure
  2024-09-03  3:07   ` Pengfei Xu
@ 2024-09-03 11:00     ` Lorenzo Stoakes
  2024-09-03 12:27       ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2024-09-03 11:00 UTC (permalink / raw)
  To: Pengfei Xu
  Cc: Liam R. Howlett, Andrew Morton, linux-mm, linux-kernel,
	Suren Baghdasaryan, Matthew Wilcox, Vlastimil Babka,
	sidhartha.kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
	Paul E . McKenney, Jeff Xu, syzkaller-bugs

On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote:
> Hi Liam R. Howlett,
>
> Greetings!
>
> There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
> Bisected and found first bad commit:
> "
> 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
> "
> It's same as below patch.
> After reverted the above commit on top of next-20240902, this issue was gone.
>
> All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
> Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
> Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
> Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
> Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
> Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
> bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
> Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
>
> And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
> same commit, all detailed info:
> https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect
>
> "

Thanks for the report! Looking into it.

> [   19.953726] cgroup: Unknown subsys name 'net'
> [   20.045121] cgroup: Unknown subsys name 'rlimit'
> [   20.138332] ------------[ cut here ]------------
> [   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
> [   20.139075] Modules linked in:
> [   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
> [   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
> [   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
> [   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
> [   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
> [   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
> [   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
> [   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
> [   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
> [   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
> [   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [   20.145665] PKRU: 55555554
> [   20.145809] Call Trace:
> [   20.145940]  <TASK>
> [   20.146056]  ? show_regs+0x6d/0x80
> [   20.146247]  ? __warn+0xf3/0x380
> [   20.146431]  ? report_bug+0x25e/0x4b0
> [   20.146650]  ? __split_vma+0x4d7/0x1020

Have repro'd locally. This is, unsurprisingly, on this line (even if trace above
doesn't decode to it unfortunately):

	vma_iter_config(vmi, new->vm_start, new->vm_end);

The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size.

At the point of invoking vma_iter_config(), the vma iterator points at
0x20ff9001, but we try to position it to 0x20ff9000.

It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we
find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...!

Perhaps maple tree corruption in a previous call somehow?


I can interestingly only repro this if I clear the qemu image each time, I'm
guessing this is somehow tied to the instantiation of the cgroup setup or such?

Am continuing the investigation.


> [   20.146864]  ? report_bug+0x2cb/0x4b0
> [   20.147070]  ? __split_vma+0x4d7/0x1020
> [   20.147281]  ? __split_vma+0x4d8/0x1020
> [   20.147492]  ? handle_bug+0xf1/0x190
> [   20.147756]  ? exc_invalid_op+0x3c/0x80
> [   20.147971]  ? asm_exc_invalid_op+0x1f/0x30
> [   20.148208]  ? __split_vma+0x467/0x1020
> [   20.148417]  ? __split_vma+0x4d7/0x1020
> [   20.148628]  ? __split_vma+0x4d7/0x1020
> [   20.148845]  ? __pfx___split_vma+0x10/0x10
> [   20.149068]  ? __kasan_check_read+0x15/0x20
> [   20.149300]  ? mark_lock.part.0+0xf3/0x17b0
> [   20.149535]  ? __kasan_check_read+0x15/0x20
> [   20.149769]  vms_gather_munmap_vmas+0x178/0xf70
> [   20.150024]  do_vmi_align_munmap+0x26e/0x640
> [   20.150257]  ? __pfx___lock_acquire+0x10/0x10
> [   20.150500]  ? __pfx_do_vmi_align_munmap+0x10/0x10
> [   20.150758]  ? __this_cpu_preempt_check+0x21/0x30
> [   20.151018]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   20.151308]  ? mtree_range_walk+0x728/0xb70
> [   20.151602]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   20.151891]  ? mas_walk+0x6a7/0x8b0
> [   20.152096]  do_vmi_munmap+0x202/0x400
> [   20.152307]  __vm_munmap+0x182/0x380
> [   20.152509]  ? __pfx___vm_munmap+0x10/0x10
> [   20.152729]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> [   20.153061]  ? lockdep_hardirqs_on+0x89/0x110
> [   20.153299]  ? trace_hardirqs_on+0x51/0x60
> [   20.153533]  ? __audit_syscall_entry+0x39c/0x500
> [   20.153790]  __x64_sys_munmap+0x62/0x90
> [   20.154001]  x64_sys_call+0x198f/0x2140
> [   20.154212]  do_syscall_64+0x6d/0x140
> [   20.154414]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   20.154683] RIP: 0033:0x7eff9343ee5d
> [   20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [   20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
> [   20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d
> [   20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000
> [   20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640
> [   20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8
> [   20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000
> [   20.158122]  </TASK>
> [   20.158245] irq event stamp: 2019
> [   20.158426] hardirqs last  enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240
> [   20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240
> [   20.159306] softirqs last  enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [   20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [   20.160233] ---[ end trace 0000000000000000 ]---
> "
>
> I hope it's helpful.
>
> Thanks!
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
> Best Regards,
> Thanks!
>
>
> On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Prior to call_mmap(), the vmas that will be replaced need to clear the
> > way for what may happen in the call_mmap().  This clean up work includes
> > clearing the ptes and calling the close() vm_ops.  Some users do more
> > setup than can be restored by calling the vm_ops open() function.  It is
> > safer to store the gap in the vma tree in these cases.
> >
> > That is to say that the failure scenario that existed before the
> > MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> > partial mapping.
> >
> > Since abort_munmap_vmas() is only reattaching vmas with this change, the
> > function is renamed to reattach_vmas().
> >
> > There is also a secondary failure that may occur if there is not enough
> > memory to store the gap.  In this case, the vmas are reattached and
> > resources freed.  If the system cannot complete the call_mmap() and
> > fails to allocate with GFP_KERNEL, then the system will print a warning
> > about the failure.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/mmap.c |  3 +--
> >  mm/vma.c  |  4 +--
> >  mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
> >  3 files changed, 61 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 405e0432c78e..e1e5c78b6c3c 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vm_unacct_memory(charged);
> >
> >  abort_munmap:
> > -	if (vms.nr_pages)
> > -		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> > +	vms_abort_munmap_vmas(&vms, &mas_detach);
> >  gather_failed:
> >  	validate_mm(mm);
> >  	return error;
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 648c58da8ad4..d2d71d659d1e 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >  munmap_gather_failed:
> >  end_split_failed:
> >  modify_vma_failed:
> > -	abort_munmap_vmas(mas_detach, /* closed = */ false);
> > +	reattach_vmas(mas_detach);
> >  start_split_failed:
> >  map_count_exceeded:
> >  	return error;
> > @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	return 0;
> >
> >  clear_tree_failed:
> > -	abort_munmap_vmas(&mas_detach, /* closed = */ false);
> > +	reattach_vmas(&mas_detach);
> >  gather_failed:
> >  	validate_mm(mm);
> >  	return error;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 64b44f5a0a11..b2306d13d456 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	       unsigned long start, unsigned long end, pgoff_t pgoff);
> >
> > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > +			struct vm_area_struct *vma, gfp_t gfp)
> > +
> > +{
> > +	if (vmi->mas.status != ma_start &&
> > +	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > +		vma_iter_invalidate(vmi);
> > +
> > +	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > +	mas_store_gfp(&vmi->mas, vma, gfp);
> > +	if (unlikely(mas_is_err(&vmi->mas)))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> >  #ifdef CONFIG_MMU
> >  /*
> >   * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
> >  		struct ma_state *mas_detach, bool mm_wr_locked);
> >
> >  /*
> > - * abort_munmap_vmas - Undo any munmap work and free resources
> > + * reattach_vmas() - Undo any munmap work and free resources
> > + * @mas_detach: The maple state with the detached maple tree
> >   *
> >   * Reattach any detached vmas and free up the maple tree used to track the vmas.
> >   */
> > -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> > +static inline void reattach_vmas(struct ma_state *mas_detach)
> >  {
> >  	struct vm_area_struct *vma;
> >
> >  	mas_set(mas_detach, 0);
> > -	mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +	mas_for_each(mas_detach, vma, ULONG_MAX)
> >  		vma_mark_detached(vma, false);
> > -		if (closed && vma->vm_ops && vma->vm_ops->open)
> > -			vma->vm_ops->open(vma);
> > -	}
> >
> >  	__mt_destroy(mas_detach->tree);
> >  }
> >
> > +/*
> > + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
> > + * operation.
> > + * @vms: The vma unmap structure
> > + * @mas_detach: The maple state with the detached maple tree
> > + *
> > + * Reattach any detached vmas, free up the maple tree used to track the vmas.
> > + * If that's not possible because the ptes are cleared (and vm_ops->closed() may
> > + * have been called), then a NULL is written over the vmas and the vmas are
> > + * removed (munmap() completed).
> > + */
> > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > +		struct ma_state *mas_detach)
> > +{
> > +	struct ma_state *mas = &vms->vmi->mas;
> > +	if (!vms->nr_pages)
> > +		return;
> > +
> > +	if (vms->clear_ptes)
> > +		return reattach_vmas(mas_detach);
> > +
> > +	/*
> > +	 * Aborting cannot just call the vm_ops open() because they are often
> > +	 * not symmetrical and state data has been lost.  Resort to the old
> > +	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
> > +	 */
> > +	mas_set_range(mas, vms->start, vms->end);
> > +	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
> > +		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> > +			     current->comm, current->pid);
> > +		/* Leaving vmas detached and in-tree may hamper recovery */
> > +		reattach_vmas(mas_detach);
> > +	} else {
> > +		/* Clean up the insertion of the unfortunate gap */
> > +		vms_complete_munmap_vmas(vms, mas_detach);
> > +	}
> > +}
> > +
> >  int
> >  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		    struct mm_struct *mm, unsigned long start,
> > @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
> >  	return mas_prev(&vmi->mas, min);
> >  }
> >
> > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > -			struct vm_area_struct *vma, gfp_t gfp)
> > -{
> > -	if (vmi->mas.status != ma_start &&
> > -	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > -		vma_iter_invalidate(vmi);
> > -
> > -	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > -	mas_store_gfp(&vmi->mas, vma, gfp);
> > -	if (unlikely(mas_is_err(&vmi->mas)))
> > -		return -ENOMEM;
> > -
> > -	return 0;
> > -}
> > -
> > -
> >  /*
> >   * These three helpers classifies VMAs for virtual memory accounting.
> >   */
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure
  2024-09-03 11:00     ` Lorenzo Stoakes
@ 2024-09-03 12:27       ` Lorenzo Stoakes
  2024-09-03 16:03         ` Liam R. Howlett
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2024-09-03 12:27 UTC (permalink / raw)
  To: Pengfei Xu, Andrew Morton
  Cc: Liam R. Howlett, linux-mm, linux-kernel, Suren Baghdasaryan,
	Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, syzkaller-bugs

Hi Andrew - TL;DR of this is - please apply the fix patch attached below to
fix a problem in this series, thanks! :)

On Tue, Sep 03, 2024 at 12:00:04PM GMT, Lorenzo Stoakes wrote:
> On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote:
> > Hi Liam R. Howlett,
> >
> > Greetings!
> >
> > There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
> > Bisected and found first bad commit:
> > "
> > 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
> > "
> > It's same as below patch.
> > After reverted the above commit on top of next-20240902, this issue was gone.
> >
> > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
> > Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
> > Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
> > Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
> > Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
> > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
> > bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
> > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> >
> > And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
> > same commit, all detailed info:
> > https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect
> >
> > "
>
> Thanks for the report! Looking into it.
>
> > [   19.953726] cgroup: Unknown subsys name 'net'
> > [   20.045121] cgroup: Unknown subsys name 'rlimit'
> > [   20.138332] ------------[ cut here ]------------
> > [   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
> > [   20.139075] Modules linked in:
> > [   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
> > [   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
> > [   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
> > [   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
> > [   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
> > [   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
> > [   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
> > [   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
> > [   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
> > [   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> > [   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
> > [   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [   20.145665] PKRU: 55555554
> > [   20.145809] Call Trace:
> > [   20.145940]  <TASK>
> > [   20.146056]  ? show_regs+0x6d/0x80
> > [   20.146247]  ? __warn+0xf3/0x380
> > [   20.146431]  ? report_bug+0x25e/0x4b0
> > [   20.146650]  ? __split_vma+0x4d7/0x1020
>
> Have repro'd locally. This is, unsurprisingly, on this line (even if trace above
> doesn't decode to it unfortunately):
>
> 	vma_iter_config(vmi, new->vm_start, new->vm_end);
>
> The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size.
>
> At the point of invoking vma_iter_config(), the vma iterator points at
> 0x20ff9001, but we try to position it to 0x20ff9000.
>
> It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we
> find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...!
>
> Perhaps maple tree corruption in a previous call somehow?
>
>
> I can interestingly only repro this if I clear the qemu image each time, I'm
> guessing this is somehow tied to the instantiation of the cgroup setup or such?
>
> Am continuing the investigation.
>

[snip]

OK I turned on CONFIG_DEBUG_VM_MAPLE_TREE and am hitting
VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm) after gather_failed is hit in
mmap_region() as a result of call_mmap() returning an error.

This is invoking kernfs_fop_mmap(), which returns -ENODEV because the
KERNFS_HAS_MMAP flag has not been set for the cgroup file being mapped.

This results in mmap_region() jumping to unmap_and_free_vma, which unmaps the
page tables in the region and goes on to abort the unmap operation.

The validate_mm() that fails is called by vms_complete_munmap_vmas() which was
invoked by vms_abort_munmap_vmas().

The tree is then corrupted:

vma ffff888013414d20 start 0000000020ff9000 end 0000000021000000 mm ffff88800d06ae40
prot 25 anon_vma ffff8880132cc660 vm_ops 0000000000000000
pgoff 20ff9 file 0000000000000000 private_data 0000000000000000
flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty)
tree range: ffff888013414d20 start 20ff9001 end 20ffffff

Incorrectly starting at off-by-one 0x20ff9001 rather than 0x20ff9000. This
is a very telling off-by... the programmer's favourite off-by-1 :) Which
then made me think of how mas operations have an _inclusive_ end and VMA
ones have an _exclusive_ one.

And so I tracked down the cause of this to vms_abort_munmap_vmas() which
was invoking mas_set_range() using vms->end (exclusive) as if it were
inclusive, which thus resulted in 0x20ff9000 being wrongly cleared.

Thes solution is simply to subtract this by one as done in the attached
fix-patch.

I confirmed this fixed the issue as I was able to set up a reliable repro
locally.

Thanks for the report! Great find.

----8<----
From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue, 3 Sep 2024 13:20:34 +0100
Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas()

Maple tree ranges have an inclusive end, VMAs do not, so we must subtract
one from the VMA-specific end value when using a mas_...() function.

We failed to do so in vms_abort_munmap_vmas() which resulted in a store
overlapping the intended range by one byte, and thus corrupting the maple
tree.

Fix this by subtracting one from vms->end() passed into mas_set_range().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vma.h b/mm/vma.h
index 370d3246f147..819f994cf727 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
 	 * not symmetrical and state data has been lost.  Resort to the old
 	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
 	 */
-	mas_set_range(mas, vms->start, vms->end);
+	mas_set_range(mas, vms->start, vms->end - 1);
 	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
 		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
 			     current->comm, current->pid);
--
2.46.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure
  2024-09-03 12:27       ` Lorenzo Stoakes
@ 2024-09-03 16:03         ` Liam R. Howlett
  0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-09-03 16:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pengfei Xu, Andrew Morton, linux-mm, linux-kernel,
	Suren Baghdasaryan, Matthew Wilcox, Vlastimil Babka,
	sidhartha.kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
	Paul E . McKenney, Jeff Xu, syzkaller-bugs

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240903 08:27]:
> Hi Andrew - TL;DR of this is - please apply the fix patch attached below to
> fix a problem in this series, thanks! :)

Oh yes.  I should have caught this, thanks Lorenzo.

Cheers,
Liam


> ----8<----
> From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Tue, 3 Sep 2024 13:20:34 +0100
> Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas()
> 
> Maple tree ranges have an inclusive end, VMAs do not, so we must subtract
> one from the VMA-specific end value when using a mas_...() function.
> 
> We failed to do so in vms_abort_munmap_vmas() which resulted in a store
> overlapping the intended range by one byte, and thus corrupting the maple
> tree.
> 
> Fix this by subtracting one from vms->end() passed into mas_set_range().
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vma.h b/mm/vma.h
> index 370d3246f147..819f994cf727 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
>  	 * not symmetrical and state data has been lost.  Resort to the old
>  	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
>  	 */
> -	mas_set_range(mas, vms->start, vms->end);
> +	mas_set_range(mas, vms->start, vms->end - 1);
>  	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
>  		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
>  			     current->comm, current->pid);
> --
> 2.46.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-08-30  4:00 ` [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-10-07 19:05   ` Jann Horn
  2024-10-07 20:31     ` Liam R. Howlett
  0 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2024-10-07 19:05 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton
  Cc: Lorenzo Stoakes, Linux-MM, kernel list, Suren Baghdasaryan,
	Matthew Wilcox, Vlastimil Babka, Sidhartha Kumar, Bert Karwatzki,
	Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu, Seth Jenkins

[-- Attachment #1: Type: text/plain, Size: 8006 bytes --]

On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> To ensure locking is downgraded correctly, the mm is set regardless of
> MAP_FIXED or not (NULL vma).
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation.  This is done using the vms_clean_up_area()
> helper.  If there is a close vm_ops, that must also be called to ensure
> any cleanup is done before mapping over the area.  This also means that
> calling open has been added to the abort of an unmap operation, for now.

As currently implemented, this is not a valid optimization because it
violates the (unwritten?) rule that you must not call free_pgd_range()
on a region in the page tables which can concurrently be walked. A
region in the page tables can be concurrently walked if it overlaps a
VMA which is linked into rmaps which are not write-locked.

On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
the new mapping is created by expanding an adjacent VMA, the following
race with an ftruncate() is possible (because page tables for the old
mapping are removed while the new VMA in the same location is already
fully set up and linked into the rmap):


task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
========================     ==================
mmap_region
  vma_merge_new_range
    vma_expand
      commit_merge
        vma_prepare
          [take rmap locks]
        vma_set_range
          [expand adjacent mapping]
        vma_complete
          [drop rmap locks]
  vms_complete_munmap_vmas
    vms_clear_ptes
      unmap_vmas
        [removes ptes]
      free_pgtables
        [unlinks old vma from rmap]
                             unmap_mapping_range
                               unmap_mapping_pages
                                 i_mmap_lock_read
                                 unmap_mapping_range_tree
                                   [loop]
                                     unmap_mapping_range_vma
                                       zap_page_range_single
                                         unmap_single_vma
                                           unmap_page_range
                                             zap_p4d_range
                                               zap_pud_range
                                                 zap_pmd_range
                                                   [looks up pmd entry]
        free_pgd_range
          [frees pmd]
                                                   [UAF pmd entry access]

To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
to widen the race windows, then build and run the attached reproducer
mmap-fixed-race.c.

Under a kernel with KASAN, you should ideally get a KASAN splat like this:

[   90.012655][ T1113]
==================================================================
[   90.013937][ T1113] BUG: KASAN: use-after-free in
unmap_page_range+0x2d4e/0x3a80
[   90.015136][ T1113] Read of size 8 at addr ffff888006a3b000 by task
SLOWME2/1113
[   90.016322][ T1113]
[   90.016695][ T1113] CPU: 12 UID: 1000 PID: 1113 Comm: SLOWME2
Tainted: G        W          6.12.0-rc2-dirty #500
[   90.018355][ T1113] Tainted: [W]=WARN
[   90.018959][ T1113] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   90.020598][ T1113] Call Trace:
[   90.021126][ T1113]  <TASK>
[   90.021590][ T1113]  dump_stack_lvl+0x53/0x70
[   90.022307][ T1113]  print_report+0xce/0x670
[   90.023008][ T1113]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[   90.023942][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.024763][ T1113]  kasan_report+0xe2/0x120
[   90.025468][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.026293][ T1113]  unmap_page_range+0x2d4e/0x3a80
[   90.027087][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.027855][ T1113]  ? next_uptodate_folio+0x148/0x890
[   90.029299][ T1113]  ? set_pte_range+0x265/0x6c0
[   90.030058][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.030826][ T1113]  ? page_table_check_set.part.0+0x2ab/0x3e0
[   90.031773][ T1113]  ? __pfx_unmap_page_range+0x10/0x10
[   90.032622][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.033394][ T1113]  ? unmap_single_vma+0xc6/0x2c0
[   90.034211][ T1113]  zap_page_range_single+0x28a/0x4b0
[   90.035052][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.035821][ T1113]  ? __pfx_zap_page_range_single+0x10/0x10
[   90.036739][ T1113]  ? __pte_offset_map+0x1d/0x250
[   90.037528][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.038295][ T1113]  ? do_fault+0x6c4/0x1270
[   90.038999][ T1113]  ? __pfx___handle_mm_fault+0x10/0x10
[   90.039862][ T1113]  unmap_mapping_range+0x1b2/0x240
[   90.040671][ T1113]  ? __pfx_unmap_mapping_range+0x10/0x10
[   90.041563][ T1113]  ? setattr_prepare+0xed/0x7e0
[   90.042330][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.043097][ T1113]  ? current_time+0x88/0x200
[   90.043826][ T1113]  shmem_setattr+0x880/0xea0
[   90.044556][ T1113]  notify_change+0x42b/0xea0
[   90.045913][ T1113]  ? do_truncate+0x10b/0x1b0
[   90.046641][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.047407][ T1113]  do_truncate+0x10b/0x1b0
[   90.048107][ T1113]  ? __pfx_do_truncate+0x10/0x10
[   90.048890][ T1113]  ? __set_task_comm+0x53/0x140
[   90.049668][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.050472][ T1113]  ? __do_sys_prctl+0x71e/0x11f0
[   90.051262][ T1113]  do_ftruncate+0x334/0x470
[   90.051976][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.052745][ T1113]  do_sys_ftruncate+0x3d/0x80
[   90.053493][ T1113]  do_syscall_64+0x4b/0x110
[   90.054209][ T1113]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   90.055142][ T1113] RIP: 0033:0x7f89cbf2bf47
[   90.055844][ T1113] Code: 77 01 c3 48 8b 15 b9 1e 0d 00 f7 d8 64 89
02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 b8 4d
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 89 1e 0d 00 f7 d8
64 89 02 b8
[   90.058924][ T1113] RSP: 002b:00007fff96968aa8 EFLAGS: 00000213
ORIG_RAX: 000000000000004d
[   90.060248][ T1113] RAX: ffffffffffffffda RBX: 00007fff96968c28
RCX: 00007f89cbf2bf47
[   90.061507][ T1113] RDX: 0000000000000000 RSI: 0000000000001000
RDI: 0000000000000003
[   90.063471][ T1113] RBP: 00007fff96968b10 R08: 0000000000000000
R09: 00007f89cbe29740
[   90.064727][ T1113] R10: 00007f89cbefb443 R11: 0000000000000213
R12: 0000000000000000
[   90.065990][ T1113] R13: 00007fff96968c38 R14: 000055ce5a58add8
R15: 00007f89cc05f020
[   90.067291][ T1113]  </TASK>
[   90.067772][ T1113]
[   90.068147][ T1113] The buggy address belongs to the physical page:
[   90.069168][ T1113] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888006a3bf50 pfn:0x6a3b
[   90.070741][ T1113] flags: 0x100000000000000(node=0|zone=1)
[   90.071649][ T1113] raw: 0100000000000000 ffffea0000111748
ffffea000020c488 0000000000000000
[   90.073009][ T1113] raw: ffff888006a3bf50 0000000000000000
00000000ffffffff 0000000000000000
[   90.074368][ T1113] page dumped because: kasan: bad access detected
[   90.075382][ T1113]
[   90.075751][ T1113] Memory state around the buggy address:
[   90.076636][ T1113]  ffff888006a3af00: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.077905][ T1113]  ffff888006a3af80: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.079874][ T1113] >ffff888006a3b000: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.081173][ T1113]                    ^
[   90.081821][ T1113]  ffff888006a3b080: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.083134][ T1113]  ffff888006a3b100: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.084416][ T1113]
==================================================================

[-- Attachment #2: mmap-vs-truncate-racewiden.diff --]
[-- Type: text/x-patch, Size: 1141 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..b95a43221058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/delay.h>
 
 #include <trace/events/kmem.h>
 
@@ -410,6 +411,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 				unlink_file_vma_batch_add(&vb, vma);
 			}
 			unlink_file_vma_batch_final(&vb);
+
+			if (strcmp(current->comm, "SLOWME1") == 0) {
+				pr_warn("%s: starting delay\n", __func__);
+				mdelay(2000);
+				pr_warn("%s: ending delay\n", __func__);
+			}
+
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
 		}
@@ -1711,6 +1719,13 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+
+	if (strcmp(current->comm, "SLOWME2") == 0) {
+		pr_warn("%s: starting delay\n", __func__);
+		mdelay(2000);
+		pr_warn("%s: ending delay\n", __func__);
+	}
+
 	do {
 		next = pmd_addr_end(addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {

[-- Attachment #3: mmap-fixed-race.c --]
[-- Type: text/x-csrc, Size: 1518 bytes --]

#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/mman.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

int main(void) {
  int fd = SYSCHK(memfd_create("foo", 0));
  SYSCHK(ftruncate(fd, 0x2000));
  char *map1 = SYSCHK(mmap((void*)(0x40000000-0x1000), 0x2000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, 0));
  ((volatile char *)map1)[0] = 'A';
  ((volatile char *)map1)[0x1000] = 'B';

  char *map2 = map1 + 0x1000;
  SYSCHK(mmap(map2, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0));
  ((volatile char *)map2)[0];
  //SYSCHK(mprotect(map2, 0x1000, PROT_READ));

  /*
  int pipefds[2];
  SYSCHK(pipe(pipefds));
  */

  if (SYSCHK(fork()) == 0) {
    SYSCHK(munmap(map1, 0x2000));
    sleep(1);
    SYSCHK(prctl(PR_SET_NAME, "SLOWME2"));
    SYSCHK(ftruncate(fd, 0x1000)); // rmap walk
    SYSCHK(prctl(PR_SET_NAME, "task2"));
    exit(0);
  }

  //system("cat /proc/$PPID/maps");
  SYSCHK(prctl(PR_SET_NAME, "SLOWME1"));
  SYSCHK(mmap(map2, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0x1000));
  SYSCHK(prctl(PR_SET_NAME, "task1"));
  //system("cat /proc/$PPID/maps");

  /*
  for (int i=0; i<8; i++) {
    char buf[0x1000];
    memset(buf, 'A', 0x1000);
    SYSCHK(write(pipefds[1], buf, 0x1000));
  }
  */

  int wstatus;
  SYSCHK(wait(&wstatus));
  return 0;
}

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-07 19:05   ` [BUG] page table UAF, " Jann Horn
@ 2024-10-07 20:31     ` Liam R. Howlett
  2024-10-07 21:31       ` Jann Horn
  0 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2024-10-07 20:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Lorenzo Stoakes, Linux-MM, kernel list,
	Suren Baghdasaryan, Matthew Wilcox, Vlastimil Babka,
	Sidhartha Kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
	Paul E . McKenney, Jeff Xu, Seth Jenkins

* Jann Horn <jannh@google.com> [241007 15:06]:
> On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > To ensure locking is downgraded correctly, the mm is set regardless of
> > MAP_FIXED or not (NULL vma).
> >
> > If a driver is mapping over an existing vma, then clear the ptes before
> > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > helper.  If there is a close vm_ops, that must also be called to ensure
> > any cleanup is done before mapping over the area.  This also means that
> > calling open has been added to the abort of an unmap operation, for now.
> 
> As currently implemented, this is not a valid optimization because it
> violates the (unwritten?) rule that you must not call free_pgd_range()
> on a region in the page tables which can concurrently be walked. A
> region in the page tables can be concurrently walked if it overlaps a
> VMA which is linked into rmaps which are not write-locked.

Just for clarity, this is the rmap write lock.

> 
> On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> the new mapping is created by expanding an adjacent VMA, the following
> race with an ftruncate() is possible (because page tables for the old
> mapping are removed while the new VMA in the same location is already
> fully set up and linked into the rmap):
> 
> 
> task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> ========================     ==================
> mmap_region
>   vma_merge_new_range
>     vma_expand
>       commit_merge
>         vma_prepare
>           [take rmap locks]
>         vma_set_range
>           [expand adjacent mapping]
>         vma_complete
>           [drop rmap locks]
>   vms_complete_munmap_vmas
>     vms_clear_ptes
>       unmap_vmas
>         [removes ptes]
>       free_pgtables
>         [unlinks old vma from rmap]
>                              unmap_mapping_range
>                                unmap_mapping_pages
>                                  i_mmap_lock_read
>                                  unmap_mapping_range_tree
>                                    [loop]
>                                      unmap_mapping_range_vma
>                                        zap_page_range_single
>                                          unmap_single_vma
>                                            unmap_page_range
>                                              zap_p4d_range
>                                                zap_pud_range
>                                                  zap_pmd_range
>                                                    [looks up pmd entry]
>         free_pgd_range
>           [frees pmd]
>                                                    [UAF pmd entry access]
> 
> To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> to widen the race windows, then build and run the attached reproducer
> mmap-fixed-race.c.
> 
> Under a kernel with KASAN, you should ideally get a KASAN splat like this:

Thanks for all the work you did finding the root cause here, I
appreciate it.

I think the correct fix is to take the rmap lock on free_pgtables, when
necessary.  There are a few code paths (error recovery) that are not
regularly run that will also need to change.

Regards,
Liam



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-07 20:31     ` Liam R. Howlett
@ 2024-10-07 21:31       ` Jann Horn
  2024-10-08  1:50         ` Liam R. Howlett
  0 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2024-10-07 21:31 UTC (permalink / raw)
  To: Liam R. Howlett, Jann Horn, Andrew Morton, Lorenzo Stoakes,
	Linux-MM, kernel list, Suren Baghdasaryan, Matthew Wilcox,
	Vlastimil Babka, Sidhartha Kumar, Bert Karwatzki, Jiri Olsa,
	Kees Cook, Paul E . McKenney, Jeff Xu, Seth Jenkins

On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [241007 15:06]:
> > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > area be overwritten and then clean up the gathered vmas using
> > > vms_complete_munmap_vmas().
> > >
> > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > MAP_FIXED or not (NULL vma).
> > >
> > > If a driver is mapping over an existing vma, then clear the ptes before
> > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > any cleanup is done before mapping over the area.  This also means that
> > > calling open has been added to the abort of an unmap operation, for now.
> >
> > As currently implemented, this is not a valid optimization because it
> > violates the (unwritten?) rule that you must not call free_pgd_range()
> > on a region in the page tables which can concurrently be walked. A
> > region in the page tables can be concurrently walked if it overlaps a
> > VMA which is linked into rmaps which are not write-locked.
>
> Just for clarity, this is the rmap write lock.

Ah, yes.

> > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > the new mapping is created by expanding an adjacent VMA, the following
> > race with an ftruncate() is possible (because page tables for the old
> > mapping are removed while the new VMA in the same location is already
> > fully set up and linked into the rmap):
> >
> >
> > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > ========================     ==================
> > mmap_region
> >   vma_merge_new_range
> >     vma_expand
> >       commit_merge
> >         vma_prepare
> >           [take rmap locks]
> >         vma_set_range
> >           [expand adjacent mapping]
> >         vma_complete
> >           [drop rmap locks]
> >   vms_complete_munmap_vmas
> >     vms_clear_ptes
> >       unmap_vmas
> >         [removes ptes]
> >       free_pgtables
> >         [unlinks old vma from rmap]
> >                              unmap_mapping_range
> >                                unmap_mapping_pages
> >                                  i_mmap_lock_read
> >                                  unmap_mapping_range_tree
> >                                    [loop]
> >                                      unmap_mapping_range_vma
> >                                        zap_page_range_single
> >                                          unmap_single_vma
> >                                            unmap_page_range
> >                                              zap_p4d_range
> >                                                zap_pud_range
> >                                                  zap_pmd_range
> >                                                    [looks up pmd entry]
> >         free_pgd_range
> >           [frees pmd]
> >                                                    [UAF pmd entry access]
> >
> > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > to widen the race windows, then build and run the attached reproducer
> > mmap-fixed-race.c.
> >
> > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
>
> Thanks for all the work you did finding the root cause here, I
> appreciate it.

Ah, this is not a bug I ran into while testing, it's a bug I found
while reading the patch. It's much easier to explain the issue and
come up with a nice reproducer this way than when you start out from a
crash. :P

> I think the correct fix is to take the rmap lock on free_pgtables, when
> necessary.  There are a few code paths (error recovery) that are not
> regularly run that will also need to change.

Hmm, yes, I guess that might work. Though I think there might be more
races: One related aspect of this optimization that is unintuitive to
me is that, directly after vma_merge_new_range(), a concurrent rmap
walk could probably be walking the newly-extended VMA but still
observe PTEs belonging to the previous VMA. I don't know how robust
the various rmap walks are to things like encountering pfnmap PTEs in
non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
example, page_vma_mapped_walk() looks like, if you called it on a page
table range with huge PUD entries, but with a VMA without VM_HUGETLB,
something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
and a 1G hugepage might get misinterpreted as a page table? But I
haven't experimentally verified that.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-07 21:31       ` Jann Horn
@ 2024-10-08  1:50         ` Liam R. Howlett
  2024-10-08 17:15           ` Jann Horn
  0 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2024-10-08  1:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Lorenzo Stoakes, Linux-MM, kernel list,
	Suren Baghdasaryan, Matthew Wilcox, Vlastimil Babka,
	Sidhartha Kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
	Paul E . McKenney, Jeff Xu, Seth Jenkins

* Jann Horn <jannh@google.com> [241007 17:31]:
> On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > area be overwritten and then clean up the gathered vmas using
> > > > vms_complete_munmap_vmas().
> > > >
> > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > MAP_FIXED or not (NULL vma).
> > > >
> > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > any cleanup is done before mapping over the area.  This also means that
> > > > calling open has been added to the abort of an unmap operation, for now.
> > >
> > > As currently implemented, this is not a valid optimization because it
> > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > on a region in the page tables which can concurrently be walked. A
> > > region in the page tables can be concurrently walked if it overlaps a
> > > VMA which is linked into rmaps which are not write-locked.
> >
> > Just for clarity, this is the rmap write lock.
> 
> Ah, yes.
> 
> > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > the new mapping is created by expanding an adjacent VMA, the following
> > > race with an ftruncate() is possible (because page tables for the old
> > > mapping are removed while the new VMA in the same location is already
> > > fully set up and linked into the rmap):
> > >
> > >
> > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > ========================     ==================
> > > mmap_region
> > >   vma_merge_new_range
> > >     vma_expand
> > >       commit_merge
> > >         vma_prepare
> > >           [take rmap locks]
> > >         vma_set_range
> > >           [expand adjacent mapping]
> > >         vma_complete
> > >           [drop rmap locks]
> > >   vms_complete_munmap_vmas
> > >     vms_clear_ptes
> > >       unmap_vmas
> > >         [removes ptes]
> > >       free_pgtables
> > >         [unlinks old vma from rmap]
> > >                              unmap_mapping_range
> > >                                unmap_mapping_pages
> > >                                  i_mmap_lock_read
> > >                                  unmap_mapping_range_tree
> > >                                    [loop]
> > >                                      unmap_mapping_range_vma
> > >                                        zap_page_range_single
> > >                                          unmap_single_vma
> > >                                            unmap_page_range
> > >                                              zap_p4d_range
> > >                                                zap_pud_range
> > >                                                  zap_pmd_range
> > >                                                    [looks up pmd entry]
> > >         free_pgd_range
> > >           [frees pmd]
> > >                                                    [UAF pmd entry access]
> > >
> > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > to widen the race windows, then build and run the attached reproducer
> > > mmap-fixed-race.c.
> > >
> > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> >
> > Thanks for all the work you did finding the root cause here, I
> > appreciate it.
> 
> Ah, this is not a bug I ran into while testing, it's a bug I found
> while reading the patch. It's much easier to explain the issue and
> come up with a nice reproducer this way than when you start out from a
> crash. :P
> 
> > I think the correct fix is to take the rmap lock on free_pgtables, when
> > necessary.  There are a few code paths (error recovery) that are not
> > regularly run that will also need to change.
> 
> Hmm, yes, I guess that might work. Though I think there might be more
> races: One related aspect of this optimization that is unintuitive to
> me is that, directly after vma_merge_new_range(), a concurrent rmap
> walk could probably be walking the newly-extended VMA but still
> observe PTEs belonging to the previous VMA. I don't know how robust
> the various rmap walks are to things like encountering pfnmap PTEs in
> non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> example, page_vma_mapped_walk() looks like, if you called it on a page
> table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> and a 1G hugepage might get misinterpreted as a page table? But I
> haven't experimentally verified that.

Yes, I am also concerned that reacquiring the lock will result in
another race.  I also don't think holding the lock for longer is a good
idea as it will most likely cause a regression by extending the lock for
the duration of the mmap() setup.  Although, maybe it would be fine if
we only keep it held if we are going to be removing a vma in the
MAP_FIXED case.

Another idea would be to change the pte to know if a vma is being
modified using the per-vma locking, but I'm not sure what action to take
if we detect the vma is being modified to avoid the issue.  This would
also need to be done to all walkers (or low enough in the stack).

By the way, this isn't an optimisation; this is to fix RCU walkers of
the vma tree seeing a hole between the underlying implementation of the
MAP_FIXED operations of munmap() and mmap().  This is needed for things
like the /proc/{pid}/maps rcu walker.  The page tables currently fall
back to the old way of locking if a hole is seen (and sane applications
shouldn't really be page faulting something being removed anyways..)

Thanks,
Liam



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-08  1:50         ` Liam R. Howlett
@ 2024-10-08 17:15           ` Jann Horn
  2024-10-08 17:51             ` Suren Baghdasaryan
  2024-10-11 14:26             ` Liam R. Howlett
  0 siblings, 2 replies; 36+ messages in thread
From: Jann Horn @ 2024-10-08 17:15 UTC (permalink / raw)
  To: Liam R. Howlett, Jann Horn, Andrew Morton, Lorenzo Stoakes,
	Linux-MM, kernel list, Suren Baghdasaryan, Matthew Wilcox,
	Vlastimil Babka, Sidhartha Kumar, Bert Karwatzki, Jiri Olsa,
	Kees Cook, Paul E . McKenney, Jeff Xu, Seth Jenkins

On Tue, Oct 8, 2024 at 3:51 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [241007 17:31]:
> > On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > > area be overwritten and then clean up the gathered vmas using
> > > > > vms_complete_munmap_vmas().
> > > > >
> > > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > > MAP_FIXED or not (NULL vma).
> > > > >
> > > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > > any cleanup is done before mapping over the area.  This also means that
> > > > > calling open has been added to the abort of an unmap operation, for now.
> > > >
> > > > As currently implemented, this is not a valid optimization because it
> > > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > > on a region in the page tables which can concurrently be walked. A
> > > > region in the page tables can be concurrently walked if it overlaps a
> > > > VMA which is linked into rmaps which are not write-locked.
> > >
> > > Just for clarity, this is the rmap write lock.
> >
> > Ah, yes.
> >
> > > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > > the new mapping is created by expanding an adjacent VMA, the following
> > > > race with an ftruncate() is possible (because page tables for the old
> > > > mapping are removed while the new VMA in the same location is already
> > > > fully set up and linked into the rmap):
> > > >
> > > >
> > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > ========================     ==================
> > > > mmap_region
> > > >   vma_merge_new_range
> > > >     vma_expand
> > > >       commit_merge
> > > >         vma_prepare
> > > >           [take rmap locks]
> > > >         vma_set_range
> > > >           [expand adjacent mapping]
> > > >         vma_complete
> > > >           [drop rmap locks]
> > > >   vms_complete_munmap_vmas
> > > >     vms_clear_ptes
> > > >       unmap_vmas
> > > >         [removes ptes]
> > > >       free_pgtables
> > > >         [unlinks old vma from rmap]
> > > >                              unmap_mapping_range
> > > >                                unmap_mapping_pages
> > > >                                  i_mmap_lock_read
> > > >                                  unmap_mapping_range_tree
> > > >                                    [loop]
> > > >                                      unmap_mapping_range_vma
> > > >                                        zap_page_range_single
> > > >                                          unmap_single_vma
> > > >                                            unmap_page_range
> > > >                                              zap_p4d_range
> > > >                                                zap_pud_range
> > > >                                                  zap_pmd_range
> > > >                                                    [looks up pmd entry]
> > > >         free_pgd_range
> > > >           [frees pmd]
> > > >                                                    [UAF pmd entry access]
> > > >
> > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > to widen the race windows, then build and run the attached reproducer
> > > > mmap-fixed-race.c.
> > > >
> > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > >
> > > Thanks for all the work you did finding the root cause here, I
> > > appreciate it.
> >
> > Ah, this is not a bug I ran into while testing, it's a bug I found
> > while reading the patch. It's much easier to explain the issue and
> > come up with a nice reproducer this way than when you start out from a
> > crash. :P
> >
> > > I think the correct fix is to take the rmap lock on free_pgtables, when
> > > necessary.  There are a few code paths (error recovery) that are not
> > > regularly run that will also need to change.
> >
> > Hmm, yes, I guess that might work. Though I think there might be more
> > races: One related aspect of this optimization that is unintuitive to
> > me is that, directly after vma_merge_new_range(), a concurrent rmap
> > walk could probably be walking the newly-extended VMA but still
> > observe PTEs belonging to the previous VMA. I don't know how robust
> > the various rmap walks are to things like encountering pfnmap PTEs in
> > non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> > example, page_vma_mapped_walk() looks like, if you called it on a page
> > table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> > something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> > and a 1G hugepage might get misinterpreted as a page table? But I
> > haven't experimentally verified that.
>
> Yes, I am also concerned that reacquiring the lock will result in
> another race.  I also don't think holding the lock for longer is a good
> idea as it will most likely cause a regression by extending the lock for
> the duration of the mmap() setup.  Although, maybe it would be fine if
> we only keep it held if we are going to be removing a vma in the
> MAP_FIXED case.

I guess you could have a separate seqcount on the mm for "there might
be temporary holes in the tree", or temporarily store some markers in
the maple tree that say "there is nothing here right now, but if you
want to see a full picture while iterating, you have to try again
later"?

Or you could basically unmap the VMA while it is still in the VMA tree
but is already locked and marked as detached? So first you do
unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
rmap links, and deletes page tables), then prepare the new VMAs, and
then replace the old VMA's entries in the VMA tree with the new
entries? I guess in the end the result would semantically be pretty
similar to having markers in the maple tree.

> Another idea would be to change the pte to know if a vma is being
> modified using the per-vma locking, but I'm not sure what action to take
> if we detect the vma is being modified to avoid the issue.  This would
> also need to be done to all walkers (or low enough in the stack).

Sorry, can you clarify what pte the "change the pte" is referring to?

> By the way, this isn't an optimisation; this is to fix RCU walkers of
> the vma tree seeing a hole between the underlying implementation of the
> MAP_FIXED operations of munmap() and mmap().  This is needed for things
> like the /proc/{pid}/maps rcu walker.  The page tables currently fall
> back to the old way of locking if a hole is seen (and sane applications
> shouldn't really be page faulting something being removed anyways..)

Is that code in a tree somewhere?

What locking will those RCU walkers use when accessing VMAs? I guess
they probably anyway have to take the VMA locks to ensure they see
consistent state, though I guess with enough effort you could avoid it
(seqlock-style) on some fastpaths when the vma is not concurrently
modified and the fastpath doesn't need access to the VMA's file?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-08 17:15           ` Jann Horn
@ 2024-10-08 17:51             ` Suren Baghdasaryan
  2024-10-08 18:06               ` Jann Horn
  2024-10-11 14:26             ` Liam R. Howlett
  1 sibling, 1 reply; 36+ messages in thread
From: Suren Baghdasaryan @ 2024-10-08 17:51 UTC (permalink / raw)
  To: Jann Horn
  Cc: Liam R. Howlett, Andrew Morton, Lorenzo Stoakes, Linux-MM,
	kernel list, Matthew Wilcox, Vlastimil Babka, Sidhartha Kumar,
	Bert Karwatzki, Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu,
	Seth Jenkins

On Tue, Oct 8, 2024 at 10:16 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Oct 8, 2024 at 3:51 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > * Jann Horn <jannh@google.com> [241007 17:31]:
> > > On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > > > area be overwritten and then clean up the gathered vmas using
> > > > > > vms_complete_munmap_vmas().
> > > > > >
> > > > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > > > MAP_FIXED or not (NULL vma).
> > > > > >
> > > > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > > > any cleanup is done before mapping over the area.  This also means that
> > > > > > calling open has been added to the abort of an unmap operation, for now.
> > > > >
> > > > > As currently implemented, this is not a valid optimization because it
> > > > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > > > on a region in the page tables which can concurrently be walked. A
> > > > > region in the page tables can be concurrently walked if it overlaps a
> > > > > VMA which is linked into rmaps which are not write-locked.
> > > >
> > > > Just for clarity, this is the rmap write lock.
> > >
> > > Ah, yes.
> > >
> > > > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > > > the new mapping is created by expanding an adjacent VMA, the following
> > > > > race with an ftruncate() is possible (because page tables for the old
> > > > > mapping are removed while the new VMA in the same location is already
> > > > > fully set up and linked into the rmap):
> > > > >
> > > > >
> > > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > > ========================     ==================
> > > > > mmap_region
> > > > >   vma_merge_new_range
> > > > >     vma_expand
> > > > >       commit_merge
> > > > >         vma_prepare
> > > > >           [take rmap locks]
> > > > >         vma_set_range
> > > > >           [expand adjacent mapping]
> > > > >         vma_complete
> > > > >           [drop rmap locks]
> > > > >   vms_complete_munmap_vmas
> > > > >     vms_clear_ptes
> > > > >       unmap_vmas
> > > > >         [removes ptes]
> > > > >       free_pgtables
> > > > >         [unlinks old vma from rmap]
> > > > >                              unmap_mapping_range
> > > > >                                unmap_mapping_pages
> > > > >                                  i_mmap_lock_read
> > > > >                                  unmap_mapping_range_tree
> > > > >                                    [loop]
> > > > >                                      unmap_mapping_range_vma
> > > > >                                        zap_page_range_single
> > > > >                                          unmap_single_vma
> > > > >                                            unmap_page_range
> > > > >                                              zap_p4d_range
> > > > >                                                zap_pud_range
> > > > >                                                  zap_pmd_range
> > > > >                                                    [looks up pmd entry]
> > > > >         free_pgd_range
> > > > >           [frees pmd]
> > > > >                                                    [UAF pmd entry access]
> > > > >
> > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > > to widen the race windows, then build and run the attached reproducer
> > > > > mmap-fixed-race.c.
> > > > >
> > > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > > >
> > > > Thanks for all the work you did finding the root cause here, I
> > > > appreciate it.
> > >
> > > Ah, this is not a bug I ran into while testing, it's a bug I found
> > > while reading the patch. It's much easier to explain the issue and
> > > come up with a nice reproducer this way than when you start out from a
> > > crash. :P
> > >
> > > > I think the correct fix is to take the rmap lock on free_pgtables, when
> > > > necessary.  There are a few code paths (error recovery) that are not
> > > > regularly run that will also need to change.
> > >
> > > Hmm, yes, I guess that might work. Though I think there might be more
> > > races: One related aspect of this optimization that is unintuitive to
> > > me is that, directly after vma_merge_new_range(), a concurrent rmap
> > > walk could probably be walking the newly-extended VMA but still
> > > observe PTEs belonging to the previous VMA. I don't know how robust
> > > the various rmap walks are to things like encountering pfnmap PTEs in
> > > non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> > > example, page_vma_mapped_walk() looks like, if you called it on a page
> > > table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> > > something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> > > and a 1G hugepage might get misinterpreted as a page table? But I
> > > haven't experimentally verified that.
> >
> > Yes, I am also concerned that reacquiring the lock will result in
> > another race.  I also don't think holding the lock for longer is a good
> > idea as it will most likely cause a regression by extending the lock for
> > the duration of the mmap() setup.  Although, maybe it would be fine if
> > we only keep it held if we are going to be removing a vma in the
> > MAP_FIXED case.
>
> I guess you could have a separate seqcount on the mm for "there might
> be temporary holes in the tree", or temporarily store some markers in
> the maple tree that say "there is nothing here right now, but if you
> want to see a full picture while iterating, you have to try again
> later"?
>
> Or you could basically unmap the VMA while it is still in the VMA tree
> but is already locked and marked as detached? So first you do
> unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
> rmap links, and deletes page tables), then prepare the new VMAs, and
> then replace the old VMA's entries in the VMA tree with the new
> entries? I guess in the end the result would semantically be pretty
> similar to having markers in the maple tree.
>
> > Another idea would be to change the pte to know if a vma is being
> > modified using the per-vma locking, but I'm not sure what action to take
> > if we detect the vma is being modified to avoid the issue.  This would
> > also need to be done to all walkers (or low enough in the stack).
>
> Sorry, can you clarify what pte the "change the pte" is referring to?
>
> > By the way, this isn't an optimisation; this is to fix RCU walkers of
> > the vma tree seeing a hole between the underlying implementation of the
> > MAP_FIXED operations of munmap() and mmap().  This is needed for things
> > like the /proc/{pid}/maps rcu walker.  The page tables currently fall
> > back to the old way of locking if a hole is seen (and sane applications
> > shouldn't really be page faulting something being removed anyways..)
>
> Is that code in a tree somewhere?
>
> What locking will those RCU walkers use when accessing VMAs? I guess
> they probably anyway have to take the VMA locks to ensure they see
> consistent state, though I guess with enough effort you could avoid it
> (seqlock-style) on some fastpaths when the vma is not concurrently
> modified and the fastpath doesn't need access to the VMA's file?

Sorry, it's not posted upstream yet but yes, the idea is to walk the
tree under RCU and detect concurrent changes using seq-counters. A
prototype was posted here:
https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/
but it had some issues I'm yet to resolve.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-08 17:51             ` Suren Baghdasaryan
@ 2024-10-08 18:06               ` Jann Horn
  0 siblings, 0 replies; 36+ messages in thread
From: Jann Horn @ 2024-10-08 18:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Andrew Morton, Lorenzo Stoakes, Linux-MM,
	kernel list, Matthew Wilcox, Vlastimil Babka, Sidhartha Kumar,
	Bert Karwatzki, Jiri Olsa, Kees Cook, Paul E . McKenney, Jeff Xu,
	Seth Jenkins

On Tue, Oct 8, 2024 at 7:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Oct 8, 2024 at 10:16 AM Jann Horn <jannh@google.com> wrote:
> > Is that code in a tree somewhere?
> >
> > What locking will those RCU walkers use when accessing VMAs? I guess
> > they probably anyway have to take the VMA locks to ensure they see
> > consistent state, though I guess with enough effort you could avoid it
> > (seqlock-style) on some fastpaths when the vma is not concurrently
> > modified and the fastpath doesn't need access to the VMA's file?
>
> Sorry, it's not posted upstream yet but yes, the idea is to walk the
> tree under RCU and detect concurrent changes using seq-counters. A
> prototype was posted here:
> https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/
> but it had some issues I'm yet to resolve.

Ah, thanks for the pointer.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-10-08 17:15           ` Jann Horn
  2024-10-08 17:51             ` Suren Baghdasaryan
@ 2024-10-11 14:26             ` Liam R. Howlett
  1 sibling, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2024-10-11 14:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Lorenzo Stoakes, Linux-MM, kernel list,
	Suren Baghdasaryan, Matthew Wilcox, Vlastimil Babka,
	Sidhartha Kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
	Paul E . McKenney, Jeff Xu, Seth Jenkins

* Jann Horn <jannh@google.com> [241008 13:16]:
...
> > > > >
> > > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > > ========================     ==================
> > > > > mmap_region
> > > > >   vma_merge_new_range
> > > > >     vma_expand
> > > > >       commit_merge
> > > > >         vma_prepare
> > > > >           [take rmap locks]
> > > > >         vma_set_range
> > > > >           [expand adjacent mapping]
> > > > >         vma_complete
> > > > >           [drop rmap locks]
> > > > >   vms_complete_munmap_vmas
> > > > >     vms_clear_ptes
> > > > >       unmap_vmas
> > > > >         [removes ptes]
> > > > >       free_pgtables
> > > > >         [unlinks old vma from rmap]
> > > > >                              unmap_mapping_range
> > > > >                                unmap_mapping_pages
> > > > >                                  i_mmap_lock_read
> > > > >                                  unmap_mapping_range_tree
> > > > >                                    [loop]
> > > > >                                      unmap_mapping_range_vma
> > > > >                                        zap_page_range_single
> > > > >                                          unmap_single_vma
> > > > >                                            unmap_page_range
> > > > >                                              zap_p4d_range
> > > > >                                                zap_pud_range
> > > > >                                                  zap_pmd_range
> > > > >                                                    [looks up pmd entry]
> > > > >         free_pgd_range
> > > > >           [frees pmd]
> > > > >                                                    [UAF pmd entry access]
> > > > >
> > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > > to widen the race windows, then build and run the attached reproducer
> > > > > mmap-fixed-race.c.
> > > > >
> > > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > > >
...

> 
> Or you could basically unmap the VMA while it is still in the VMA tree
> but is already locked and marked as detached? So first you do
> unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
> rmap links, and deletes page tables), then prepare the new VMAs, and
> then replace the old VMA's entries in the VMA tree with the new
> entries? I guess in the end the result would semantically be pretty
> similar to having markers in the maple tree.
> 

After trying a few other methods, I ended up doing something like you
said above.  I already had to do this if call_mmap() was to be used, so
the code change isn't that large.  Doing it unconditionally on MAP_FIXED
seems like the safest plan.

The other methods were unsuccessful due to the locking order that exists
in fsreclaim and other areas.

Basically, the vma tree will not see a gap, but the rmap will see a gap.

Unfortunately this expands the number of failures which cannot be undone
with my design but still less than existed before.  Most errors will
generate the historic vma gap, sadly.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-10-11 14:26 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30  4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 01/21] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 02/21] mm/vma: Introduce abort_munmap_vmas() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 03/21] mm/vma: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 04/21] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 05/21] mm/vma: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 06/21] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 07/21] mm/vma: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 08/21] mm/vma: Inline munmap operation in mmap_region() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 09/21] mm/vma: Expand mmap_region() munmap call Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 10/21] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 12/21] mm/vma: Track start and end for munmap in vma_munmap_struct Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 13/21] mm: Clean up unmap_region() argument list Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-10-07 19:05   ` [BUG] page table UAF, " Jann Horn
2024-10-07 20:31     ` Liam R. Howlett
2024-10-07 21:31       ` Jann Horn
2024-10-08  1:50         ` Liam R. Howlett
2024-10-08 17:15           ` Jann Horn
2024-10-08 17:51             ` Suren Baghdasaryan
2024-10-08 18:06               ` Jann Horn
2024-10-11 14:26             ` Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
2024-09-03  3:07   ` Pengfei Xu
2024-09-03 11:00     ` Lorenzo Stoakes
2024-09-03 12:27       ` Lorenzo Stoakes
2024-09-03 16:03         ` Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 16/21] mm/mmap: Use PHYS_PFN in mmap_region() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 17/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 18/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
2024-08-30  4:00 ` [PATCH v8 19/21] mm: Move may_expand_vm() check in mmap_region() Liam R. Howlett
2024-08-30  4:01 ` [PATCH v8 20/21] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
2024-08-30  4:01 ` [PATCH v8 21/21] mm/vma.h: Optimise vma_munmap_struct Liam R. Howlett
2024-08-30 16:05 ` [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Jeff Xu
2024-08-30 17:07   ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).