* [PATCH v4 00/21] Avoid MAP_FIXED gap exposure
@ 2024-07-10 19:22 Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
` (20 more replies)
0 siblings, 21 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention. Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode. The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.
This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts. The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree. The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).
Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed. vma_start_write() are called on both
parts of the split vma, so this state is detectable.
RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20240625191145.3382793-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20240704182718.2653918-1-Liam.Howlett@oracle.com/
Changes since v3:
- Completely removing arch_unmap() from the kernel. PPC doesn't need
it and no one else uses it.
- Relocated checks for mseal'ed vmas so it is only checked when
necessary.
- Remove do_vma_munmap() and use do_vmi_align_munmap() in its place
- Added inclusive/exclusive comments for start/end of munmap
- Added comments for unmap_start/unmap_end to specify it is for PTEs
- Renamed "cleared_ptes" to "clear_ptes" and reversed the logic so that
it is now a flag to indicate that the ptes need to be cleared vs it
was done.
- Set the "clear_ptes" flag after a successful vms_gather_munmap_vmas()
- Rename vms_complete_pte_clear() to vms_clear_ptes() since it may
happen before the completion of the vms in the case of a driver
mmap'ing in mmap_region().
- Fixed comment around vms_clear_ptes() in mmap_region().
- Call init_vma_munmap() unconditionally in the mmap_region() case so
that all defaults are set in the struct, which means
init_vma_munmap() must support a NULL vma.
- Use ULONG_MAX as the limit in abort_munmap_vmas() for clarity
- Added a comment highlighting that the free_pgtables() call may use a
different start/end based on if there was a prev/next vma
- Removed incorrect comment about VM_ACCOUNT and mremap's move_vma()
- Relocated to mas_store_gfp() call in vms_gather_munmap_vmas() so that
it is clear that the accounting is okay.
- Skip validate_mm() in do_vmi_align_munmap() on gather failure as
vms_gather_munmap_vmas() already validates.
- Added R-b from Lorenzo, Suren, and Kees - Thanks!
Liam R. Howlett (21):
mm/mmap: Correctly position vma_iterator in __split_vma()
mm/mmap: Introduce abort_munmap_vmas()
mm/mmap: Introduce vmi_complete_munmap_vmas()
mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
mm/mmap: Introduce vma_munmap_struct for use in munmap operations
mm/mmap: Change munmap to use vma_munmap_struct() for accounting and
surrounding vmas
mm/mmap: Extract validate_mm() from vma_complete()
mm/mmap: Inline munmap operation in mmap_region()
mm/mmap: Expand mmap_region() munmap call
mm/mmap: Support vma == NULL in init_vma_munmap()
mm/mmap: Reposition vma iterator in mmap_region()
mm/mmap: Track start and end of munmap in vma_munmap_struct
mm/mmap: Clean up unmap_region() argument list
mm/mmap: Avoid zeroing vma tree in mmap_region()
mm/mmap: Use PHYS_PFN in mmap_region()
mm/mmap: Use vms accounted pages in mmap_region()
mm/mmap: Drop arch_unmap() call from all archs
mm/mmap: Move can_modify_mm() check down the stack
ipc/shm, mm: Drop do_vma_munmap()
mm/mmap: Move may_expand_vm() check in mmap_region()
mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas()
arch/powerpc/include/asm/mmu_context.h | 9 -
arch/x86/include/asm/mmu_context.h | 5 -
include/asm-generic/mm_hooks.h | 11 +-
include/linux/mm.h | 6 +-
ipc/shm.c | 8 +-
mm/internal.h | 25 ++
mm/mmap.c | 545 ++++++++++++++-----------
7 files changed, 345 insertions(+), 264 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v4 01/21] mm/mmap: Correctly position vma_iterator in __split_vma()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
` (19 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
The vma iterator may be left pointing to the newly created vma. This
happens when inserting the new vma at the end of the old vma
(!new_below).
The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.
This has limited impact in the current code, but is required for future
changes.
Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
mm/mmap.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index e42d89f98071..28a46d9ddde0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2414,7 +2414,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
/*
* __split_vma() bypasses sysctl_max_map_count checking. We use this where it
* has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original vma.
*/
static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long addr, int new_below)
@@ -2483,6 +2483,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* Success. */
if (new_below)
vma_next(vmi);
+ else
+ vma_prev(vmi);
+
return 0;
out_free_mpol:
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 02/21] mm/mmap: Introduce abort_munmap_vmas()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
` (18 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
Extract clean up of failed munmap() operations from
do_vmi_align_munmap(). This simplifies later patches in the series.
It is worth noting that the mas_for_each() loop now has a different
upper limit. This should not change the number of vmas visited for
reattaching to the main vma tree (mm_mt), as all vmas are reattached in
both scenarios.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 28a46d9ddde0..babfa50f1411 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2586,6 +2586,22 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
vma->vm_userfaultfd_ctx, anon_vma_name(vma));
}
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+ struct vm_area_struct *vma;
+
+ mas_set(mas_detach, 0);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ vma_mark_detached(vma, false);
+
+ __mt_destroy(mas_detach->tree);
+}
+
/*
* do_vmi_align_munmap() - munmap the aligned region from @start to @end.
* @vmi: The vma iterator
@@ -2740,11 +2756,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
userfaultfd_error:
munmap_gather_failed:
end_split_failed:
- mas_set(&mas_detach, 0);
- mas_for_each(&mas_detach, next, end)
- vma_mark_detached(next, false);
-
- __mt_destroy(&mt_detach);
+ abort_munmap_vmas(&mas_detach);
start_split_failed:
map_count_exceeded:
validate_mm(mm);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
` (17 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation. Extracting this makes
the later patch in the series easier to understand.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 81 +++++++++++++++++++++++++++++++++++++------------------
1 file changed, 55 insertions(+), 26 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index babfa50f1411..bd3378935c70 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2602,6 +2602,58 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
__mt_destroy(mas_detach->tree);
}
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @vma: The first vma to be munmapped
+ * @mm: The mm struct
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested. Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start,
+ unsigned long end, bool unlock, struct ma_state *mas_detach,
+ unsigned long locked_vm)
+{
+ struct vm_area_struct *prev, *next;
+ int count;
+
+ count = mas_detach->index + 1;
+ mm->map_count -= count;
+ mm->locked_vm -= locked_vm;
+ if (unlock)
+ mmap_write_downgrade(mm);
+
+ prev = vma_iter_prev_range(vmi);
+ next = vma_next(vmi);
+ if (next)
+ vma_iter_prev_range(vmi);
+
+ /*
+ * We can free page tables without write-locking mmap_lock because VMAs
+ * were isolated before we downgraded mmap_lock.
+ */
+ mas_set(mas_detach, 1);
+ unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+ !unlock);
+ /* Statistics and freeing VMAs */
+ mas_set(mas_detach, 0);
+ remove_mt(mm, mas_detach);
+ validate_mm(mm);
+ if (unlock)
+ mmap_read_unlock(mm);
+
+ __mt_destroy(mas_detach->tree);
+}
+
/*
* do_vmi_align_munmap() - munmap the aligned region from @start to @end.
* @vmi: The vma iterator
@@ -2621,7 +2673,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
unsigned long end, struct list_head *uf, bool unlock)
{
- struct vm_area_struct *prev, *next = NULL;
+ struct vm_area_struct *next = NULL;
struct maple_tree mt_detach;
int count = 0;
int error = -ENOMEM;
@@ -2725,31 +2777,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto clear_tree_failed;
/* Point of no return */
- mm->locked_vm -= locked_vm;
- mm->map_count -= count;
- if (unlock)
- mmap_write_downgrade(mm);
-
- prev = vma_iter_prev_range(vmi);
- next = vma_next(vmi);
- if (next)
- vma_iter_prev_range(vmi);
-
- /*
- * We can free page tables without write-locking mmap_lock because VMAs
- * were isolated before we downgraded mmap_lock.
- */
- mas_set(&mas_detach, 1);
- unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
- !unlock);
- /* Statistics and freeing VMAs */
- mas_set(&mas_detach, 0);
- remove_mt(mm, &mas_detach);
- validate_mm(mm);
- if (unlock)
- mmap_read_unlock(mm);
-
- __mt_destroy(&mt_detach);
+ vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+ locked_vm);
return 0;
clear_tree_failed:
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (2 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
` (16 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later. Part of the gathering is the
splitting of vmas that span the boundary.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 80 ++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 58 insertions(+), 22 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index bd3378935c70..0d03fcf2ac0b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2655,32 +2655,30 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
}
/*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date. Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
* @vmi: The vma iterator
* @vma: The starting vm_area_struct
* @mm: The mm_struct
* @start: The aligned start address to munmap.
* @end: The aligned end address to munmap.
* @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock. unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
+ * @locked_vm: a pointer to store the VM_LOCKED pages count.
*
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success
*/
static int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
- unsigned long end, struct list_head *uf, bool unlock)
+ unsigned long end, struct list_head *uf,
+ struct ma_state *mas_detach, unsigned long *locked_vm)
{
struct vm_area_struct *next = NULL;
- struct maple_tree mt_detach;
int count = 0;
int error = -ENOMEM;
- unsigned long locked_vm = 0;
- MA_STATE(mas_detach, &mt_detach, 0, 0);
- mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
- mt_on_stack(mt_detach);
/*
* If we need to split any vma, do it now to save pain later.
@@ -2719,15 +2717,15 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto end_split_failed;
}
vma_start_write(next);
- mas_set(&mas_detach, count);
- error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
+ mas_set(mas_detach, count++);
+ error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
if (error)
goto munmap_gather_failed;
+
vma_mark_detached(next, true);
if (next->vm_flags & VM_LOCKED)
- locked_vm += vma_pages(next);
+ *locked_vm += vma_pages(next);
- count++;
if (unlikely(uf)) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
@@ -2752,7 +2750,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
{
- MA_STATE(test, &mt_detach, 0, 0);
+ MA_STATE(test, mas_detach->tree, 0, 0);
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;
@@ -2772,6 +2770,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
while (vma_iter_addr(vmi) > start)
vma_iter_prev_range(vmi);
+ return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+ abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+ return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock. unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+static int
+do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct list_head *uf, bool unlock)
+{
+ struct maple_tree mt_detach;
+ MA_STATE(mas_detach, &mt_detach, 0, 0);
+ mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+ mt_on_stack(mt_detach);
+ int error;
+ unsigned long locked_vm = 0;
+
+ error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+ &mas_detach, &locked_vm);
+ if (error)
+ goto gather_failed;
+
error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
if (error)
goto clear_tree_failed;
@@ -2782,12 +2822,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
return 0;
clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
validate_mm(mm);
return error;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (3 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
` (15 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.
Update vmi_ function names to vms_ to indicate the first argument
type change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/internal.h | 16 ++++++
mm/mmap.c | 138 ++++++++++++++++++++++++++------------------------
2 files changed, 89 insertions(+), 65 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 2ea9a88dcb95..43b3c99c77ba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1481,6 +1481,22 @@ struct vma_prepare {
struct vm_area_struct *remove2;
};
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+ struct vma_iterator *vmi;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma; /* The first vma to munmap */
+ struct list_head *uf; /* Userfaultfd list_head */
+ unsigned long start; /* Aligned start addr (inclusive) */
+ unsigned long end; /* Aligned end addr (exclusive) */
+ int vma_count; /* Number of vmas that will be removed */
+ unsigned long nr_pages; /* Number of pages being removed */
+ unsigned long locked_vm; /* Number of locked pages */
+ bool unlock; /* Unlock after the munmap */
+};
+
void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid);
diff --git a/mm/mmap.c b/mm/mmap.c
index 0d03fcf2ac0b..1ed0720c38c5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
}
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation. Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+ struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, struct list_head *uf,
+ bool unlock)
+{
+ vms->vmi = vmi;
+ vms->vma = vma;
+ vms->mm = vma->vm_mm;
+ vms->start = start;
+ vms->end = end;
+ vms->unlock = unlock;
+ vms->uf = uf;
+ vms->vma_count = 0;
+ vms->nr_pages = vms->locked_vm = 0;
+}
/*
* vma_prepare() - Helper function for handling locking VMAs prior to altering
@@ -2603,81 +2628,63 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
}
/*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @vma: The first vma to be munmapped
- * @mm: The mm struct
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
*
- * This function updates the mm_struct, unmaps the region, frees the resources
+ * This updates the mm_struct, unmaps the region, frees the resources
* used for the munmap() and may downgrade the lock - if requested. Everything
* needed to be done once the vma maple tree is updated.
*/
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
- struct mm_struct *mm, unsigned long start,
- unsigned long end, bool unlock, struct ma_state *mas_detach,
- unsigned long locked_vm)
+
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach)
{
struct vm_area_struct *prev, *next;
- int count;
+ struct mm_struct *mm;
- count = mas_detach->index + 1;
- mm->map_count -= count;
- mm->locked_vm -= locked_vm;
- if (unlock)
+ mm = vms->mm;
+ mm->map_count -= vms->vma_count;
+ mm->locked_vm -= vms->locked_vm;
+ if (vms->unlock)
mmap_write_downgrade(mm);
- prev = vma_iter_prev_range(vmi);
- next = vma_next(vmi);
+ prev = vma_iter_prev_range(vms->vmi);
+ next = vma_next(vms->vmi);
if (next)
- vma_iter_prev_range(vmi);
+ vma_iter_prev_range(vms->vmi);
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
- !unlock);
+ unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+ vms->vma_count, !vms->unlock);
/* Statistics and freeing VMAs */
mas_set(mas_detach, 0);
remove_mt(mm, mas_detach);
validate_mm(mm);
- if (unlock)
+ if (vms->unlock)
mmap_read_unlock(mm);
__mt_destroy(mas_detach->tree);
}
/*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
* for removal at a later date. Handles splitting first and last if necessary
* and marking the vmas as isolated.
*
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
* @mas_detach: The maple state tracking the detached tree
- * @locked_vm: a pointer to store the VM_LOCKED pages count.
*
* Return: 0 on success
*/
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
- struct mm_struct *mm, unsigned long start,
- unsigned long end, struct list_head *uf,
- struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach)
{
struct vm_area_struct *next = NULL;
- int count = 0;
int error = -ENOMEM;
/*
@@ -2689,17 +2696,18 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
*/
/* Does it split the first one? */
- if (start > vma->vm_start) {
+ if (vms->start > vms->vma->vm_start) {
/*
* Make sure that map_count on return from munmap() will
* not exceed its limit; but let map_count go just above
* its limit temporarily, to help free resources as expected.
*/
- if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+ if (vms->end < vms->vma->vm_end &&
+ vms->mm->map_count >= sysctl_max_map_count)
goto map_count_exceeded;
- error = __split_vma(vmi, vma, start, 1);
+ error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
if (error)
goto start_split_failed;
}
@@ -2708,25 +2716,25 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- next = vma;
+ next = vms->vma;
do {
/* Does it split the end? */
- if (next->vm_end > end) {
- error = __split_vma(vmi, next, end, 0);
+ if (next->vm_end > vms->end) {
+ error = __split_vma(vms->vmi, next, vms->end, 0);
if (error)
goto end_split_failed;
}
vma_start_write(next);
- mas_set(mas_detach, count++);
+ mas_set(mas_detach, vms->vma_count++);
error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
if (error)
goto munmap_gather_failed;
vma_mark_detached(next, true);
if (next->vm_flags & VM_LOCKED)
- *locked_vm += vma_pages(next);
+ vms->locked_vm += vma_pages(next);
- if (unlikely(uf)) {
+ if (unlikely(vms->uf)) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
* will remain split, but userland will get a
@@ -2736,16 +2744,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
* split, despite we could. This is unlikely enough
* failure that it's not worth optimizing it for.
*/
- error = userfaultfd_unmap_prep(next, start, end, uf);
+ error = userfaultfd_unmap_prep(next, vms->start,
+ vms->end, vms->uf);
if (error)
goto userfaultfd_error;
}
#ifdef CONFIG_DEBUG_VM_MAPLE_TREE
- BUG_ON(next->vm_start < start);
- BUG_ON(next->vm_start > end);
+ BUG_ON(next->vm_start < vms->start);
+ BUG_ON(next->vm_start > vms->end);
#endif
- } for_each_vma_range(*vmi, next, end);
+ } for_each_vma_range(*(vms->vmi), next, vms->end);
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
@@ -2754,21 +2763,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;
- vma_iter_set(vmi, start);
+ vma_iter_set(vms->vmi, vms->start);
rcu_read_lock();
- vma_test = mas_find(&test, count - 1);
- for_each_vma_range(*vmi, vma_mas, end) {
+ vma_test = mas_find(&test, vms->vma_count - 1);
+ for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
BUG_ON(vma_mas != vma_test);
test_count++;
- vma_test = mas_next(&test, count - 1);
+ vma_test = mas_next(&test, vms->vma_count - 1);
}
rcu_read_unlock();
- BUG_ON(count != test_count);
+ BUG_ON(vms->vma_count != test_count);
}
#endif
- while (vma_iter_addr(vmi) > start)
- vma_iter_prev_range(vmi);
+ while (vma_iter_addr(vms->vmi) > vms->start)
+ vma_iter_prev_range(vms->vmi);
return 0;
@@ -2804,11 +2813,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
MA_STATE(mas_detach, &mt_detach, 0, 0);
mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
mt_on_stack(mt_detach);
+ struct vma_munmap_struct vms;
int error;
- unsigned long locked_vm = 0;
- error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
- &mas_detach, &locked_vm);
+ init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+ error = vms_gather_munmap_vmas(&vms, &mas_detach);
if (error)
goto gather_failed;
@@ -2817,8 +2826,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto clear_tree_failed;
/* Point of no return */
- vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
- locked_vm);
+ vms_complete_munmap_vmas(&vms, &mas_detach);
return 0;
clear_tree_failed:
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (4 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
` (14 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.
Since remove_mt() is only called in one location and the contents will
be reduced to almost nothing. The remains of the function can be added
to vms_complete_munmap_vmas().
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/internal.h | 6 ++++
mm/mmap.c | 80 +++++++++++++++++++++++++--------------------------
2 files changed, 46 insertions(+), 40 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 43b3c99c77ba..a22547125c13 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
struct vma_iterator *vmi;
struct mm_struct *mm;
struct vm_area_struct *vma; /* The first vma to munmap */
+ struct vm_area_struct *prev; /* vma before the munmap area */
+ struct vm_area_struct *next; /* vma after the munmap area */
struct list_head *uf; /* Userfaultfd list_head */
unsigned long start; /* Aligned start addr (inclusive) */
unsigned long end; /* Aligned end addr (exclusive) */
int vma_count; /* Number of vmas that will be removed */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
+ unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
+ unsigned long exec_vm;
+ unsigned long stack_vm;
+ unsigned long data_vm;
bool unlock; /* Unlock after the munmap */
};
diff --git a/mm/mmap.c b/mm/mmap.c
index 1ed0720c38c5..62ff7aa10004 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -523,7 +523,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->unlock = unlock;
vms->uf = uf;
vms->vma_count = 0;
- vms->nr_pages = vms->locked_vm = 0;
+ vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+ vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
}
/*
@@ -2388,30 +2389,6 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
return vma;
}
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
- unsigned long nr_accounted = 0;
- struct vm_area_struct *vma;
-
- /* Update high watermark before we lower total_vm */
- update_hiwater_vm(mm);
- mas_for_each(mas, vma, ULONG_MAX) {
- long nrpages = vma_pages(vma);
-
- if (vma->vm_flags & VM_ACCOUNT)
- nr_accounted += nrpages;
- vm_stat_account(mm, vma->vm_flags, -nrpages);
- remove_vma(vma, false);
- }
- vm_unacct_memory(nr_accounted);
-}
-
/*
* Get rid of page table information in the indicated region.
*
@@ -2632,15 +2609,14 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
* @vms: The vma munmap struct
* @mas_detach: The maple state of the detached vmas
*
- * This updates the mm_struct, unmaps the region, frees the resources
+ * This function updates the mm_struct, unmaps the region, frees the resources
* used for the munmap() and may downgrade the lock - if requested. Everything
* needed to be done once the vma maple tree is updated.
*/
-
static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach)
{
- struct vm_area_struct *prev, *next;
+ struct vm_area_struct *vma;
struct mm_struct *mm;
mm = vms->mm;
@@ -2649,21 +2625,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- prev = vma_iter_prev_range(vms->vmi);
- next = vma_next(vms->vmi);
- if (next)
- vma_iter_prev_range(vms->vmi);
-
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
- vms->vma_count, !vms->unlock);
- /* Statistics and freeing VMAs */
+ unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+ vms->start, vms->end, vms->vma_count, !vms->unlock);
+ /* Update high watermark before we lower total_vm */
+ update_hiwater_vm(mm);
+ /* Stat accounting */
+ WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+ mm->exec_vm -= vms->exec_vm;
+ mm->stack_vm -= vms->stack_vm;
+ mm->data_vm -= vms->data_vm;
+ /* Remove and clean up vmas */
mas_set(mas_detach, 0);
- remove_mt(mm, mas_detach);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ remove_vma(vma, false);
+
+ vm_unacct_memory(vms->nr_accounted);
validate_mm(mm);
if (vms->unlock)
mmap_read_unlock(mm);
@@ -2711,13 +2692,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
if (error)
goto start_split_failed;
}
+ vms->prev = vma_prev(vms->vmi);
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- next = vms->vma;
- do {
+ for_each_vma_range(*(vms->vmi), next, vms->end) {
+ long nrpages;
/* Does it split the end? */
if (next->vm_end > vms->end) {
error = __split_vma(vms->vmi, next, vms->end, 0);
@@ -2731,6 +2713,22 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
goto munmap_gather_failed;
vma_mark_detached(next, true);
+ nrpages = vma_pages(next);
+
+ vms->nr_pages += nrpages;
+ if (next->vm_flags & VM_LOCKED)
+ vms->locked_vm += nrpages;
+
+ if (next->vm_flags & VM_ACCOUNT)
+ vms->nr_accounted += nrpages;
+
+ if (is_exec_mapping(next->vm_flags))
+ vms->exec_vm += nrpages;
+ else if (is_stack_mapping(next->vm_flags))
+ vms->stack_vm += nrpages;
+ else if (is_data_mapping(next->vm_flags))
+ vms->data_vm += nrpages;
+
if (next->vm_flags & VM_LOCKED)
vms->locked_vm += vma_pages(next);
@@ -2754,7 +2752,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
BUG_ON(next->vm_start < vms->start);
BUG_ON(next->vm_start > vms->end);
#endif
- } for_each_vma_range(*(vms->vmi), next, vms->end);
+ }
+
+ vms->next = vma_next(vms->vmi);
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 07/21] mm/mmap: Extract validate_mm() from vma_complete()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (5 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
` (13 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
vma_complete() will need to be called during an unsafe time to call
validate_mm(). Extract the call in all places now so that only one
location can be modified in the next change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 62ff7aa10004..1c9016fb6b5c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
}
if (vp->insert && vp->file)
uprobe_mmap(vp->insert);
- validate_mm(mm);
}
/*
@@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
return 0;
nomem:
@@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_clear(vmi);
vma_set_range(vma, start, end, pgoff);
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
return 0;
}
@@ -1103,6 +1104,7 @@ static struct vm_area_struct
}
vma_complete(&vp, vmi, mm);
+ validate_mm(mm);
khugepaged_enter_vma(res, vm_flags);
return res;
@@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* vma_complete stores the new vma */
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
/* Success. */
if (new_below)
@@ -3354,6 +3357,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, mm);
+ validate_mm(mm);
khugepaged_enter_vma(vma, flags);
goto out;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 08/21] mm/mmap: Inline munmap operation in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (6 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
` (12 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.
The inlining of the function and checks is an intermediate step in the
series so future patches are easier to follow.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 1c9016fb6b5c..49b3ab406353 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2938,12 +2938,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return -ENOMEM;
}
- /* Unmap any existing mapping in the area */
- error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
- if (error == -EPERM)
- return error;
- else if (error)
- return -ENOMEM;
+
+ if (unlikely(!can_modify_mm(mm, addr, end)))
+ return -EPERM;
+
+ /* arch_unmap() might do unmaps itself. */
+ arch_unmap(mm, addr, end);
+
+ /* Find the first overlapping VMA */
+ vma = vma_find(&vmi, end);
+ if (vma) {
+ /* Unmap any existing mapping in the area */
+ if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+ return -ENOMEM;
+ vma = NULL;
+ }
/*
* Private writable mapping: check memory availability
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (7 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-11 14:16 ` Lorenzo Stoakes
2024-07-10 19:22 ` [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
` (11 subsequent siblings)
20 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 49b3ab406353..a1544a68558e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2916,6 +2916,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct vm_area_struct *next, *prev, *merge;
pgoff_t pglen = len >> PAGE_SHIFT;
unsigned long charged = 0;
+ struct vma_munmap_struct vms;
+ struct ma_state mas_detach;
+ struct maple_tree mt_detach;
unsigned long end = addr + len;
unsigned long merge_start = addr, merge_end = end;
bool writable_file_mapping = false;
@@ -2948,10 +2951,27 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
if (vma) {
- /* Unmap any existing mapping in the area */
- if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+ mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+ mt_on_stack(mt_detach);
+ mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+ init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+ /* Prepare to unmap any existing mapping in the area */
+ if (vms_gather_munmap_vmas(&vms, &mas_detach))
+ return -ENOMEM;
+
+ /* Remove any existing mappings from the vma tree */
+ if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
return -ENOMEM;
+
+ /* Unmap any existing mapping in the area */
+ vms_complete_munmap_vmas(&vms, &mas_detach);
+ next = vms.next;
+ prev = vms.prev;
+ vma_prev(&vmi);
vma = NULL;
+ } else {
+ next = vma_next(&vmi);
+ prev = vma_prev(&vmi);
}
/*
@@ -2964,8 +2984,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}
- next = vma_next(&vmi);
- prev = vma_prev(&vmi);
if (vm_flags & VM_SPECIAL) {
if (prev)
vma_iter_next_range(&vmi);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (8 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-11 14:28 ` Lorenzo Stoakes
2024-07-10 19:22 ` [PATCH v4 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
` (10 subsequent siblings)
20 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Adding support for a NULL vma means the init_vma_munmap() can be
initialized for a less error-prone process when calling
vms_complete_munmap_vmas() later on.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index a1544a68558e..e2e6b3202c25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -516,10 +516,12 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
bool unlock)
{
vms->vmi = vmi;
- vms->vma = vma;
- vms->mm = vma->vm_mm;
- vms->start = start;
- vms->end = end;
+ if (vma) {
+ vms->vma = vma;
+ vms->mm = vma->vm_mm;
+ vms->start = start;
+ vms->end = end;
+ }
vms->unlock = unlock;
vms->uf = uf;
vms->vma_count = 0;
@@ -2950,11 +2952,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
+ init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
if (vma) {
mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
mt_on_stack(mt_detach);
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
- init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
return -ENOMEM;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 11/21] mm/mmap: Reposition vma iterator in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (9 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
` (9 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location. Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.
The vma_prev() call in the case of merging the previous vma is
equivalent to vma_iter_prev_range(), since the vma iterator will be
pointing to the location just before the previous vma.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 38 ++++++++++++++++++++++----------------
1 file changed, 22 insertions(+), 16 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index e2e6b3202c25..83cfe575c114 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2959,21 +2959,22 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
- return -ENOMEM;
+ goto gather_failed;
/* Remove any existing mappings from the vma tree */
if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
- return -ENOMEM;
+ goto clear_tree_failed;
/* Unmap any existing mapping in the area */
vms_complete_munmap_vmas(&vms, &mas_detach);
next = vms.next;
prev = vms.prev;
- vma_prev(&vmi);
vma = NULL;
} else {
next = vma_next(&vmi);
prev = vma_prev(&vmi);
+ if (prev)
+ vma_iter_next_range(&vmi);
}
/*
@@ -2986,11 +2987,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}
- if (vm_flags & VM_SPECIAL) {
- if (prev)
- vma_iter_next_range(&vmi);
+ if (vm_flags & VM_SPECIAL)
goto cannot_expand;
- }
/* Attempt to expand an old mapping */
/* Check next */
@@ -3011,19 +3009,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_start = prev->vm_start;
vma = prev;
vm_pgoff = prev->vm_pgoff;
- } else if (prev) {
- vma_iter_next_range(&vmi);
+ vma_prev(&vmi); /* Equivalent to going to the previous range */
}
- /* Actually expand, if possible */
- if (vma &&
- !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
- khugepaged_enter_vma(vma, vm_flags);
- goto expanded;
+ if (vma) {
+ /* Actually expand, if possible */
+ if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+ khugepaged_enter_vma(vma, vm_flags);
+ goto expanded;
+ }
+
+ /* If the expand fails, then reposition the vma iterator */
+ if (unlikely(vma == prev))
+ vma_iter_set(&vmi, addr);
}
- if (vma == prev)
- vma_iter_set(&vmi, addr);
cannot_expand:
/*
@@ -3184,6 +3184,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_unacct_memory(charged);
validate_mm(mm);
return error;
+
+clear_tree_failed:
+ abort_munmap_vmas(&mas_detach);
+gather_failed:
+ validate_mm(mm);
+ return -ENOMEM;
}
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (10 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
` (8 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Set the start and end address for munmap when the prev and next are
gathered. This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.
Add a new helper vms_complete_pte_clear(), which is needed later and
will avoid growing the argument list to unmap_region() beyond the 9 it
already has.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/internal.h | 2 ++
mm/mmap.c | 35 ++++++++++++++++++++++++++++-------
2 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index a22547125c13..11e90c6e5a3e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
struct list_head *uf; /* Userfaultfd list_head */
unsigned long start; /* Aligned start addr (inclusive) */
unsigned long end; /* Aligned end addr (exclusive) */
+ unsigned long unmap_start; /* Unmap PTE start */
+ unsigned long unmap_end; /* Unmap PTE end */
int vma_count; /* Number of vmas that will be removed */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
diff --git a/mm/mmap.c b/mm/mmap.c
index 83cfe575c114..12a5ca6c979f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -527,6 +527,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->vma_count = 0;
vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+ vms->unmap_start = FIRST_USER_ADDRESS;
+ vms->unmap_end = USER_PGTABLES_CEILING;
}
/*
@@ -2609,6 +2611,27 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
__mt_destroy(mas_detach->tree);
}
+
+static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool mm_wr_locked)
+{
+ struct mmu_gather tlb;
+
+ /*
+ * We can free page tables without write-locking mmap_lock because VMAs
+ * were isolated before we downgraded mmap_lock.
+ */
+ mas_set(mas_detach, 1);
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, vms->mm);
+ update_hiwater_rss(vms->mm);
+ unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+ mas_set(mas_detach, 1);
+ /* start and end may be different if there is no prev or next vma. */
+ free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+ tlb_finish_mmu(&tlb);
+}
+
/*
* vms_complete_munmap_vmas() - Finish the munmap() operation
* @vms: The vma munmap struct
@@ -2630,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- /*
- * We can free page tables without write-locking mmap_lock because VMAs
- * were isolated before we downgraded mmap_lock.
- */
- mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
- vms->start, vms->end, vms->vma_count, !vms->unlock);
+ vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
/* Update high watermark before we lower total_vm */
update_hiwater_vm(mm);
/* Stat accounting */
@@ -2698,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
goto start_split_failed;
}
vms->prev = vma_prev(vms->vmi);
+ if (vms->prev)
+ vms->unmap_start = vms->prev->vm_end;
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -2760,6 +2779,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
}
vms->next = vma_next(vms->vmi);
+ if (vms->next)
+ vms->unmap_end = vms->next->vm_start;
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 13/21] mm/mmap: Clean up unmap_region() argument list
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (11 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
` (7 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
With the only caller to unmap_region() being the error path of
mmap_region(), the argument list can be significantly reduced.
There is also no need to forward declare the static function any
longer.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 25 +++++++++----------------
1 file changed, 9 insertions(+), 16 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 12a5ca6c979f..870c2d04ad6b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,11 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
static bool ignore_rlimit_data;
core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next, unsigned long start,
- unsigned long end, unsigned long tree_end, bool mm_wr_locked);
-
static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
@@ -2400,22 +2395,21 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
*
* Called with the mm semaphore held.
*/
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next, unsigned long start,
- unsigned long end, unsigned long tree_end, bool mm_wr_locked)
+static void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, struct vm_area_struct *next)
{
+ struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long mt_start = mas->index;
lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
- mas_set(mas, mt_start);
+ unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+ /* mm_wr_locked = */ true);
+ mas_set(mas, vma->vm_end);
free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING,
- mm_wr_locked);
+ next ? next->vm_start : USER_PGTABLES_CEILING,
+ /* mm_wr_locked = */ true);
tlb_finish_mmu(&tlb);
}
@@ -3193,8 +3187,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_iter_set(&vmi, vma->vm_end);
/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
- vma->vm_end, vma->vm_end, true);
+ unmap_region(&vmi.mas, vma, prev, next);
}
if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (12 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-11 15:25 ` Lorenzo Stoakes
2024-07-16 12:46 ` kernel test robot
2024-07-10 19:22 ` [PATCH v4 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
` (6 subsequent siblings)
20 siblings, 2 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().
If a driver is mapping over an existing vma, then clear the ptes before
the call_mmap() invocation. This is done using the vms_clear_ptes()
helper.
Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.
This also drops the validate_mm() call in the vma_expand() function.
It is necessary to drop the validate as it would fail since the mm
map_count would be incorrect during a vma expansion, prior to the
cleanup from vms_complete_munmap_vmas().
Clean up the error handing of the vms_gather_munmap_vmas() by calling
the verification within the function.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/internal.h | 1 +
mm/mmap.c | 80 +++++++++++++++++++++++++++------------------------
2 files changed, 44 insertions(+), 37 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 11e90c6e5a3e..dd4eede1be0f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
unsigned long stack_vm;
unsigned long data_vm;
bool unlock; /* Unlock after the munmap */
+ bool clear_ptes; /* If there are outstanding PTE to be cleared */
};
void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/mmap.c b/mm/mmap.c
index 870c2d04ad6b..58cf42e22bfe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
}
static unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end)
+ unsigned long addr, unsigned long end,
+ unsigned long *nr_accounted)
{
VMA_ITERATOR(vmi, mm, addr);
struct vm_area_struct *vma;
unsigned long nr_pages = 0;
+ *nr_accounted = 0;
for_each_vma_range(vmi, vma, end) {
unsigned long vm_start = max(addr, vma->vm_start);
unsigned long vm_end = min(end, vma->vm_end);
nr_pages += PHYS_PFN(vm_end - vm_start);
+ if (vma->vm_flags & VM_ACCOUNT)
+ *nr_accounted += PHYS_PFN(vm_end - vm_start);
}
return nr_pages;
@@ -524,6 +528,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
vms->unmap_start = FIRST_USER_ADDRESS;
vms->unmap_end = USER_PGTABLES_CEILING;
+ vms->clear_ptes = false; /* No PTEs to clear yet */
}
/*
@@ -732,7 +737,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, vma->vm_mm);
- validate_mm(vma->vm_mm);
return 0;
nomem:
@@ -2606,11 +2610,14 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
}
-static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
struct ma_state *mas_detach, bool mm_wr_locked)
{
struct mmu_gather tlb;
+ if (!vms->clear_ptes) /* Nothing to do */
+ return;
+
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
@@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
/* start and end may be different if there is no prev or next vma. */
free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
tlb_finish_mmu(&tlb);
+ vms->clear_ptes = false;
}
/*
@@ -2647,7 +2655,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
+ vms_clear_ptes(vms, mas_detach, !vms->unlock);
/* Update high watermark before we lower total_vm */
update_hiwater_vm(mm);
/* Stat accounting */
@@ -2799,6 +2807,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
while (vma_iter_addr(vms->vmi) > vms->start)
vma_iter_prev_range(vms->vmi);
+ /* There are now PTEs that need to be cleared */
+ vms->clear_ptes = true;
+
return 0;
userfaultfd_error:
@@ -2807,6 +2818,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
abort_munmap_vmas(mas_detach);
start_split_failed:
map_count_exceeded:
+ validate_mm(vms->mm);
return error;
}
@@ -2851,8 +2863,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
clear_tree_failed:
abort_munmap_vmas(&mas_detach);
-gather_failed:
validate_mm(mm);
+gather_failed:
return error;
}
@@ -2940,24 +2952,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long merge_start = addr, merge_end = end;
bool writable_file_mapping = false;
pgoff_t vm_pgoff;
- int error;
+ int error = -ENOMEM;
VMA_ITERATOR(vmi, mm, addr);
+ unsigned long nr_pages, nr_accounted;
- /* Check against address space limit. */
- if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
- unsigned long nr_pages;
-
- /*
- * MAP_FIXED may remove pages of mappings that intersects with
- * requested mapping. Account for the pages it would unmap.
- */
- nr_pages = count_vma_pages_range(mm, addr, end);
-
- if (!may_expand_vm(mm, vm_flags,
- (len >> PAGE_SHIFT) - nr_pages))
- return -ENOMEM;
- }
+ nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
+ /*
+ * Check against address space limit.
+ * MAP_FIXED may remove pages of mappings that intersects with requested
+ * mapping. Account for the pages it would unmap.
+ */
+ if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+ return -ENOMEM;
if (unlikely(!can_modify_mm(mm, addr, end)))
return -EPERM;
@@ -2974,18 +2981,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
- goto gather_failed;
-
- /* Remove any existing mappings from the vma tree */
- if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
- goto clear_tree_failed;
+ return -ENOMEM;
- /* Unmap any existing mapping in the area */
- vms_complete_munmap_vmas(&vms, &mas_detach);
next = vms.next;
prev = vms.prev;
vma = NULL;
} else {
+ /* Minimal setup of vms */
next = vma_next(&vmi);
prev = vma_prev(&vmi);
if (prev)
@@ -2997,8 +2999,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
+ charged -= nr_accounted;
if (security_vm_enough_memory_mm(mm, charged))
- return -ENOMEM;
+ goto abort_munmap;
+ vms.nr_accounted = 0;
vm_flags |= VM_ACCOUNT;
}
@@ -3047,10 +3051,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* not unmapped, but the maps are removed from the list.
*/
vma = vm_area_alloc(mm);
- if (!vma) {
- error = -ENOMEM;
+ if (!vma)
goto unacct_error;
- }
vma_iter_config(&vmi, addr, end);
vma_set_range(vma, addr, end, pgoff);
@@ -3059,6 +3061,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (file) {
vma->vm_file = get_file(file);
+ /* call_mmap() may map PTE, so ensure there are no existing PTEs */
+ vms_clear_ptes(&vms, &mas_detach, true);
error = call_mmap(file, vma);
if (error)
goto unmap_and_free_vma;
@@ -3149,6 +3153,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
expanded:
perf_event_mmap(vma);
+ /* Unmap any existing mapping in the area */
+ if (vms.nr_pages)
+ vms_complete_munmap_vmas(&vms, &mas_detach);
+
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -3196,14 +3204,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unacct_error:
if (charged)
vm_unacct_memory(charged);
- validate_mm(mm);
- return error;
-clear_tree_failed:
- abort_munmap_vmas(&mas_detach);
-gather_failed:
+abort_munmap:
+ if (vms.nr_pages)
+ abort_munmap_vmas(&mas_detach);
validate_mm(mm);
- return -ENOMEM;
+ return error;
}
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 15/21] mm/mmap: Use PHYS_PFN in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (13 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
` (5 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of shifting the length by PAGE_SIZE, use PHYS_PFN. Also use the
existing local variable everywhere instead of some of the time.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 58cf42e22bfe..e13f032fa69e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2943,7 +2943,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct vm_area_struct *next, *prev, *merge;
- pgoff_t pglen = len >> PAGE_SHIFT;
+ pgoff_t pglen = PHYS_PFN(len);
unsigned long charged = 0;
struct vma_munmap_struct vms;
struct ma_state mas_detach;
@@ -2963,7 +2963,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* MAP_FIXED may remove pages of mappings that intersects with requested
* mapping. Account for the pages it would unmap.
*/
- if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+ if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
return -ENOMEM;
if (unlikely(!can_modify_mm(mm, addr, end)))
@@ -2998,7 +2998,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
- charged = len >> PAGE_SHIFT;
+ charged = pglen;
charged -= nr_accounted;
if (security_vm_enough_memory_mm(mm, charged))
goto abort_munmap;
@@ -3157,14 +3157,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vms.nr_pages)
vms_complete_munmap_vmas(&vms, &mas_detach);
- vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+ vm_stat_account(mm, vm_flags, pglen);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
vm_flags_clear(vma, VM_LOCKED_MASK);
else
- mm->locked_vm += (len >> PAGE_SHIFT);
+ mm->locked_vm += pglen;
}
if (file)
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 16/21] mm/mmap: Use vms accounted pages in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (14 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs Liam R. Howlett
` (4 subsequent siblings)
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, linux-security-module, Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation. This is necessary for a future patch.
This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-security-module@vger.kernel.org
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index e13f032fa69e..d5bd404893a8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2987,7 +2987,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
prev = vms.prev;
vma = NULL;
} else {
- /* Minimal setup of vms */
next = vma_next(&vmi);
prev = vma_prev(&vmi);
if (prev)
@@ -2999,9 +2998,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*/
if (accountable_mapping(file, vm_flags)) {
charged = pglen;
- charged -= nr_accounted;
- if (security_vm_enough_memory_mm(mm, charged))
+ charged -= vms.nr_accounted;
+ if (charged && security_vm_enough_memory_mm(mm, charged))
goto abort_munmap;
+
vms.nr_accounted = 0;
vm_flags |= VM_ACCOUNT;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (15 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:27 ` Dave Hansen
2024-07-10 21:02 ` LEROY Christophe
2024-07-10 19:22 ` [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
` (3 subsequent siblings)
20 siblings, 2 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Dave Hansen, LEROY Christophe, linuxppc-dev,
Dmitry Safonov, Michael Ellerman
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The arch_unmap call was previously moved above the rbtree modifications
in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
corruption"). The move was motivated by an issue with calling
arch_unmap() after the rbtree was modified.
Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
arch_unmap() prior to modifying the vma tree no longer exists
(regardless of rbtree or maple tree implementations).
Furthermore, the powerpc implementation is also no longer needed as per
[1] and [2]. So the arch_unmap() function can be completely removed.
Link: https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/
Link: https://github.com/linuxppc/issues/issues/241
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: LEROY Christophe <christophe.leroy2@cs-soprasteria.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Dmitry Safonov <dima@arista.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
---
arch/powerpc/include/asm/mmu_context.h | 9 ---------
arch/x86/include/asm/mmu_context.h | 5 -----
include/asm-generic/mm_hooks.h | 11 +++--------
mm/mmap.c | 12 ++----------
4 files changed, 5 insertions(+), 32 deletions(-)
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 37bffa0f7918..a334a1368848 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -260,15 +260,6 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
extern void arch_exit_mmap(struct mm_struct *mm);
-static inline void arch_unmap(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
- unsigned long vdso_base = (unsigned long)mm->context.vdso;
-
- if (start <= vdso_base && vdso_base < end)
- mm->context.vdso = NULL;
-}
-
#ifdef CONFIG_PPC_MEM_KEYS
bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write,
bool execute, bool foreign);
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 8dac45a2c7fc..80f2a3187aa6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -232,11 +232,6 @@ static inline bool is_64bit_mm(struct mm_struct *mm)
}
#endif
-static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
- unsigned long end)
-{
-}
-
/*
* We only want to enforce protection keys on the current process
* because we effectively have no access to PKRU for other
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index 4dbb177d1150..f7996376baf9 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -1,8 +1,8 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
- * Define generic no-op hooks for arch_dup_mmap, arch_exit_mmap
- * and arch_unmap to be included in asm-FOO/mmu_context.h for any
- * arch FOO which doesn't need to hook these.
+ * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap to be
+ * included in asm-FOO/mmu_context.h for any arch FOO which doesn't need to hook
+ * these.
*/
#ifndef _ASM_GENERIC_MM_HOOKS_H
#define _ASM_GENERIC_MM_HOOKS_H
@@ -17,11 +17,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
{
}
-static inline void arch_unmap(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
-}
-
static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index d5bd404893a8..df565f51971d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2652,6 +2652,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
mm = vms->mm;
mm->map_count -= vms->vma_count;
mm->locked_vm -= vms->locked_vm;
+
if (vms->unlock)
mmap_write_downgrade(mm);
@@ -2879,7 +2880,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
*
* This function takes a @mas that is either pointing to the previous VMA or set
* to MA_START and sets it up to remove the mapping(s). The @len will be
- * aligned and any arch_unmap work will be preformed.
+ * aligned.
*
* Return: 0 on success and drops the lock if so directed, error and leaves the
* lock held otherwise.
@@ -2899,16 +2900,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
return -EINVAL;
/*
- * Check if memory is sealed before arch_unmap.
* Prevent unmapping a sealed VMA.
* can_modify_mm assumes we have acquired the lock on MM.
*/
if (unlikely(!can_modify_mm(mm, start, end)))
return -EPERM;
- /* arch_unmap() might do unmaps itself. */
- arch_unmap(mm, start, end);
-
/* Find the first overlapping VMA */
vma = vma_find(vmi, end);
if (!vma) {
@@ -2969,9 +2966,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (unlikely(!can_modify_mm(mm, addr, end)))
return -EPERM;
- /* arch_unmap() might do unmaps itself. */
- arch_unmap(mm, addr, end);
-
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
@@ -3348,14 +3342,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
/*
- * Check if memory is sealed before arch_unmap.
* Prevent unmapping a sealed VMA.
* can_modify_mm assumes we have acquired the lock on MM.
*/
if (unlikely(!can_modify_mm(mm, start, end)))
return -EPERM;
- arch_unmap(mm, start, end);
return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (16 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-17 5:03 ` Jeff Xu
2024-07-10 19:22 ` [PATCH v4 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
` (2 subsequent siblings)
20 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett, Jeff Xu
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Without an arch_unmap() call anymore, the check for mseal'ed vmas can be
moved lower as well. This has the benefit of only actually checking if
things are msealed when there is anything to check. That is, we know
there is at least one vma that is in the way and needs to be checked.
Only call the can_modify_mm() in do_vmi_align_munmap() and the MAP_FIXED
case of mmap_region().
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Jeff Xu <jeffxu@chromium.org>
---
mm/mmap.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index df565f51971d..c343366b3ad2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2849,6 +2849,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vma_munmap_struct vms;
int error;
+ /* Prevent unmapping a sealed VMA. */
+ if (unlikely(!can_modify_mm(mm, start, end)))
+ return -EPERM;
+
init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
error = vms_gather_munmap_vmas(&vms, &mas_detach);
if (error)
@@ -2899,13 +2903,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
if (end == start)
return -EINVAL;
- /*
- * Prevent unmapping a sealed VMA.
- * can_modify_mm assumes we have acquired the lock on MM.
- */
- if (unlikely(!can_modify_mm(mm, start, end)))
- return -EPERM;
-
/* Find the first overlapping VMA */
vma = vma_find(vmi, end);
if (!vma) {
@@ -2963,13 +2960,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
return -ENOMEM;
- if (unlikely(!can_modify_mm(mm, addr, end)))
- return -EPERM;
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
if (vma) {
+ /* Prevent unmapping a sealed VMA. */
+ if (unlikely(!can_modify_mm(mm, addr, end)))
+ return -EPERM;
+
mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
mt_on_stack(mt_detach);
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
@@ -3341,13 +3340,6 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
- /*
- * Prevent unmapping a sealed VMA.
- * can_modify_mm assumes we have acquired the lock on MM.
- */
- if (unlikely(!can_modify_mm(mm, start, end)))
- return -EPERM;
-
return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 19/21] ipc/shm, mm: Drop do_vma_munmap()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (17 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
20 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap(). All callers now use a vma iterator and since the
mseal check of can_modify_mm() has been moved to do_vmi_align_munmap()
and the vmas are aligned, this function can just be called instead.
do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
include/linux/mm.h | 6 +++---
ipc/shm.c | 8 ++++----
mm/mmap.c | 33 +++++----------------------------
3 files changed, 12 insertions(+), 35 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2140ea6ae98..40f1db1fb233 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3400,14 +3400,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
+extern int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct list_head *uf, bool unlock);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
#ifdef CONFIG_MMU
-extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct list_head *uf, bool unlock);
extern int __mm_populate(unsigned long addr, unsigned long len,
int ignore_errors);
static inline void mm_populate(unsigned long addr, unsigned long len)
diff --git a/ipc/shm.c b/ipc/shm.c
index 3e3071252dac..99564c870084 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
*/
file = vma->vm_file;
size = i_size_read(file_inode(vma->vm_file));
- do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
- NULL, false);
+ do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+ vma->vm_end, NULL, false);
/*
* We discovered the size of the shm segment, so
* break out of here and fall through to the next
@@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
if ((vma->vm_ops == &shm_vm_ops) &&
((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
(vma->vm_file == file)) {
- do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
- NULL, false);
+ do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+ vma->vm_end, NULL, false);
}
vma = vma_next(&vmi);
diff --git a/mm/mmap.c b/mm/mmap.c
index c343366b3ad2..18c269bf8703 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -268,11 +268,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
goto out; /* mapping intersects with an existing non-brk vma. */
/*
* mm->brk must be protected by write mmap_lock.
- * do_vma_munmap() will drop the lock on success, so update it
- * before calling do_vma_munmap().
+ * do_vmi_align_munmap() will drop the lock on success, so
+ * update it before calling do_vma_munmap().
*/
mm->brk = brk;
- if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
+ if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
+ /* unlock = */ true))
goto out;
goto success_unlocked;
@@ -2837,7 +2838,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
* Return: 0 on success and drops the lock if so directed, error and leaves the
* lock held otherwise.
*/
-static int
+int
do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
unsigned long end, struct list_head *uf, bool unlock)
@@ -3319,30 +3320,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
}
-/*
- * do_vma_munmap() - Unmap a full or partial vma.
- * @vmi: The vma iterator pointing at the vma
- * @vma: The first vma to be munmapped
- * @start: the start of the address to unmap
- * @end: The end of the address to unmap
- * @uf: The userfaultfd list_head
- * @unlock: Drop the lock on success
- *
- * unmaps a VMA mapping when the vma iterator is already in position.
- * Does not handle alignment.
- *
- * Return: 0 on success drops the lock of so directed, error on failure and will
- * still hold the lock.
- */
-int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long start, unsigned long end, struct list_head *uf,
- bool unlock)
-{
- struct mm_struct *mm = vma->vm_mm;
-
- return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
-}
-
/*
* do_brk_flags() - Increase the brk vma if the flags match.
* @vmi: The vma iterator
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (18 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-11 15:38 ` Lorenzo Stoakes
2024-07-10 19:22 ` [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
20 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 36 ++++--------------------------------
1 file changed, 4 insertions(+), 32 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 18c269bf8703..7a440e7da55a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -401,27 +401,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
}
-static unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end,
- unsigned long *nr_accounted)
-{
- VMA_ITERATOR(vmi, mm, addr);
- struct vm_area_struct *vma;
- unsigned long nr_pages = 0;
-
- *nr_accounted = 0;
- for_each_vma_range(vmi, vma, end) {
- unsigned long vm_start = max(addr, vma->vm_start);
- unsigned long vm_end = min(end, vma->vm_end);
-
- nr_pages += PHYS_PFN(vm_end - vm_start);
- if (vma->vm_flags & VM_ACCOUNT)
- *nr_accounted += PHYS_PFN(vm_end - vm_start);
- }
-
- return nr_pages;
-}
-
static void __vma_link_file(struct vm_area_struct *vma,
struct address_space *mapping)
{
@@ -2949,17 +2928,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
pgoff_t vm_pgoff;
int error = -ENOMEM;
VMA_ITERATOR(vmi, mm, addr);
- unsigned long nr_pages, nr_accounted;
-
- nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
- /*
- * Check against address space limit.
- * MAP_FIXED may remove pages of mappings that intersects with requested
- * mapping. Account for the pages it would unmap.
- */
- if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
- return -ENOMEM;
/* Find the first overlapping VMA */
@@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_iter_next_range(&vmi);
}
+ /* Check against address space limit. */
+ if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+ goto abort_munmap;
+
/*
* Private writable mapping: check memory availability
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas()
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (19 preceding siblings ...)
2024-07-10 19:22 ` [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
@ 2024-07-10 19:22 ` Liam R. Howlett
2024-07-11 15:39 ` Lorenzo Stoakes
20 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 19:22 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()"). The move_vma() was altered to fix the
fragile state of the accounting since then.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 7a440e7da55a..ca3872e5fbd8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2675,13 +2675,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
/*
* If we need to split any vma, do it now to save pain later.
- *
- * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
- * unmapped vm_area_struct will remain in use: so lower split_vma
- * places tmp vma above, and higher split_vma places tmp vma below.
+ * Does it split the first one?
*/
-
- /* Does it split the first one? */
if (vms->start > vms->vma->vm_start) {
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-10 19:22 ` [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs Liam R. Howlett
@ 2024-07-10 19:27 ` Dave Hansen
2024-07-10 21:02 ` LEROY Christophe
1 sibling, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2024-07-10 19:27 UTC (permalink / raw)
To: Liam R. Howlett, linux-mm, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
LEROY Christophe, linuxppc-dev, Dmitry Safonov, Michael Ellerman
On 7/10/24 12:22, Liam R. Howlett wrote:
> The arch_unmap call was previously moved above the rbtree modifications
> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> corruption"). The move was motivated by an issue with calling
> arch_unmap() after the rbtree was modified.
>
> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> arch_unmap() prior to modifying the vma tree no longer exists
> (regardless of rbtree or maple tree implementations).
>
> Furthermore, the powerpc implementation is also no longer needed as per
> [1] and [2]. So the arch_unmap() function can be completely removed.
Thanks for doing this cleanup, Liam!
Acked-by: Dave Hansen <dave.hansen@intel.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-10 19:22 ` [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs Liam R. Howlett
2024-07-10 19:27 ` Dave Hansen
@ 2024-07-10 21:02 ` LEROY Christophe
2024-07-10 23:26 ` Liam R. Howlett
1 sibling, 1 reply; 38+ messages in thread
From: LEROY Christophe @ 2024-07-10 21:02 UTC (permalink / raw)
To: Liam R. Howlett, linux-mm@kvack.org, Andrew Morton
Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar@oracle.com, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel@vger.kernel.org,
Kees Cook, Dave Hansen, linuxppc-dev@lists.ozlabs.org,
Dmitry Safonov, Michael Ellerman
Le 10/07/2024 à 21:22, Liam R. Howlett a écrit :
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The arch_unmap call was previously moved above the rbtree modifications
> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> corruption"). The move was motivated by an issue with calling
> arch_unmap() after the rbtree was modified.
>
> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> arch_unmap() prior to modifying the vma tree no longer exists
> (regardless of rbtree or maple tree implementations).
>
> Furthermore, the powerpc implementation is also no longer needed as per
> [1] and [2]. So the arch_unmap() function can be completely removed.
I'm not sure to understand. Is it replaced by something else ?
We wanted to get rid of arch_unmap() but it was supposed to be replaced
by some core function because the functionnality itself is still
required and indeed all the discussion around [2] demonstrated that not
only powerpc but at least arm and probably others needed to properly
clean-up reference to VDSO mappings on unmapping.
So as mentioned by Michael you can't just drop that without replacing it
by something else. We need the VDSO signal handling to properly fallback
on stack-based trampoline when the VDSO trampoline gets mapped out.
Or did I miss something ?
Christophe
>
> Link: https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/
> Link: https://github.com/linuxppc/issues/issues/241
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-10 21:02 ` LEROY Christophe
@ 2024-07-10 23:26 ` Liam R. Howlett
2024-07-11 8:28 ` LEROY Christophe
0 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-10 23:26 UTC (permalink / raw)
To: LEROY Christophe
Cc: linux-mm@kvack.org, Andrew Morton, Suren Baghdasaryan,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
sidhartha.kumar@oracle.com, Paul E . McKenney, Bert Karwatzki,
Jiri Olsa, linux-kernel@vger.kernel.org, Kees Cook, Dave Hansen,
linuxppc-dev@lists.ozlabs.org, Dmitry Safonov, Michael Ellerman
* LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 17:02]:
>
>
> Le 10/07/2024 à 21:22, Liam R. Howlett a écrit :
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The arch_unmap call was previously moved above the rbtree modifications
> > in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> > corruption"). The move was motivated by an issue with calling
> > arch_unmap() after the rbtree was modified.
> >
> > Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> > ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> > arch_unmap() prior to modifying the vma tree no longer exists
> > (regardless of rbtree or maple tree implementations).
> >
> > Furthermore, the powerpc implementation is also no longer needed as per
> > [1] and [2]. So the arch_unmap() function can be completely removed.
>
> I'm not sure to understand. Is it replaced by something else ?
> We wanted to get rid of arch_unmap() but it was supposed to be replaced
> by some core function because the functionnality itself is still
> required and indeed all the discussion around [2] demonstrated that not
> only powerpc but at least arm and probably others needed to properly
> clean-up reference to VDSO mappings on unmapping.
>
> So as mentioned by Michael you can't just drop that without replacing it
> by something else. We need the VDSO signal handling to properly fallback
> on stack-based trampoline when the VDSO trampoline gets mapped out.
I'll address this after the part I missed..
>
> Or did I miss something ?
>
I think I missed something in regards to what you need in ppc.
From what I understand, other platforms still map and use the vdso
(context.vdso is set), but unmap_arch() does nothing. It is only the
powerpc version that clears the vdso pointer if it is unmapped.
git grep -w arch_unmap shows:
arch/powerpc/include/asm/mmu_context.h
arch/x86/include/asm/mmu_context.h
include/asm-generic/mm_hooks.h
mm/mmap.c
The generic and x86 versions are empty.
From the patch set you referenced, we see changes related to the files
modified, but I don't think any of them did anything with unmap_arch().
arm: a0d2fcd62ac2 ("vdso/ARM: Make union vdso_data_store available for all architectures")
arm64: d0fba04847ae ("arm64: vdso: Use generic union vdso_data_store")
mips: d697a9997a0d ("MIPS: vdso: Use generic union vdso_data_store")
s390: cb3444cfdb48 ("s390/vdso: Use generic union vdso_data_store")
riscv: eba755314fa7 ("riscv: vdso: Use generic union vdso_data_store")
ia64 is dead
nds32 is dead
hexagon has a bunch of vdso work in the logs as well.
There is also a6c19dfe3994 ("arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area")
I do not see sparc changing away from what the patches were doing, but
again, the arch_unmap() seems to do nothing there as well.
So, what I was looking to do is to avoid a call to arch specific
functions that does nothing but set the vdso pointer to NULL for
powerpc.
The thread referenced in the git bug [1] seems to indicate this is for
CRIU unmapping/restoring a task, but CRIU now just moves the vdso
mapping (or just works on ppc at this point?). Since [2] hasn't landed,
isn't this still broken for CRIU on powerpc as it is?
So, are we keeping the unmap_arch() function around, which has errors
that were never fixed, for a single application that utilizes a newer
method of moving the vdso anyways?
On the note of CRIU, it seems it cannot handle tasks which don't have
the vdso mapped anymore [3], so setting it to NULL is probably a bad
plan even for that one application?
So, I think this just leaves the fallback when the VDSO is unmapped..
Well, it seems what people have been doing is unmap the vdso to stop
these functions from working [4]. At least this is what some users are
doing. The ability to replace this vma with a guard vma leads me to
believe that other archs don't fall back at all - please correct me if
I'm wrong!
I also cannot find any reference to other archs clearing the
context.vdso (aside from failures in __setup_additional_pages).
But maybe I don't fully understand how this works?
Thanks,
Liam
[1] https://lore.kernel.org/lkml/87d0lht1c0.fsf@concordia.ellerman.id.au/
[2] https://lore.kernel.org/lkml/9c2b2826-4083-fc9c-5a4d-c101858dd560@linux.vnet.ibm.com/
[3] https://github.com/checkpoint-restore/criu/issues/488
[4] https://github.com/insanitybit/void-ship
Thanks,
Liam
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-10 23:26 ` Liam R. Howlett
@ 2024-07-11 8:28 ` LEROY Christophe
2024-07-11 15:59 ` Liam R. Howlett
0 siblings, 1 reply; 38+ messages in thread
From: LEROY Christophe @ 2024-07-11 8:28 UTC (permalink / raw)
To: Liam R. Howlett, linux-mm@kvack.org, Andrew Morton,
Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
Matthew Wilcox, sidhartha.kumar@oracle.com, Paul E . McKenney,
Bert Karwatzki, Jiri Olsa, linux-kernel@vger.kernel.org,
Kees Cook, Dave Hansen, linuxppc-dev@lists.ozlabs.org,
Dmitry Safonov, Michael Ellerman
Le 11/07/2024 à 01:26, Liam R. Howlett a écrit :
> * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 17:02]:
>>
>>
>> Le 10/07/2024 à 21:22, Liam R. Howlett a écrit :
>>> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>>>
>>> The arch_unmap call was previously moved above the rbtree modifications
>>> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
>>> corruption"). The move was motivated by an issue with calling
>>> arch_unmap() after the rbtree was modified.
>>>
>>> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
>>> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
>>> arch_unmap() prior to modifying the vma tree no longer exists
>>> (regardless of rbtree or maple tree implementations).
>>>
>>> Furthermore, the powerpc implementation is also no longer needed as per
>>> [1] and [2]. So the arch_unmap() function can be completely removed.
>>
>> I'm not sure to understand. Is it replaced by something else ?
>> We wanted to get rid of arch_unmap() but it was supposed to be replaced
>> by some core function because the functionnality itself is still
>> required and indeed all the discussion around [2] demonstrated that not
>> only powerpc but at least arm and probably others needed to properly
>> clean-up reference to VDSO mappings on unmapping.
>>
>> So as mentioned by Michael you can't just drop that without replacing it
>> by something else. We need the VDSO signal handling to properly fallback
>> on stack-based trampoline when the VDSO trampoline gets mapped out.
>
> I'll address this after the part I missed..
After ? What do you mean ? It needs to be addressed _before_ removing
arch_unmap()
>
>>
>> Or did I miss something ?
>>
>
> I think I missed something in regards to what you need in ppc.
It is not only powerpc. Powerpc is the only one doing it at the moment
but investigation has demonstrated that other architectures are affected.
>
> From what I understand, other platforms still map and use the vdso
> (context.vdso is set), but unmap_arch() does nothing. It is only the
> powerpc version that clears the vdso pointer if it is unmapped.
Yes on powerpc it works. On other platforms like arm it segfaults so it
should be fixed
(https://lore.kernel.org/lkml/87imd5h5kb.fsf@mpe.ellerman.id.au/)
Could be fixed by properly implementing arch_unmap() on every arch, or
carry-on with Dmitry's series.
>
> git grep -w arch_unmap shows:
> arch/powerpc/include/asm/mmu_context.h
> arch/x86/include/asm/mmu_context.h
> include/asm-generic/mm_hooks.h
> mm/mmap.c
>
> The generic and x86 versions are empty.
>
> From the patch set you referenced, we see changes related to the files
> modified, but I don't think any of them did anything with unmap_arch().
In the v3 series from Dmitry, [PATCH v3 16/23] mm: Add vdso_base in
mm_struct
(https://lore.kernel.org/all/20210611180242.711399-17-dima@arista.com/)
it is done via special_mapping_close()
>
> arm: a0d2fcd62ac2 ("vdso/ARM: Make union vdso_data_store available for all architectures")
> arm64: d0fba04847ae ("arm64: vdso: Use generic union vdso_data_store")
> mips: d697a9997a0d ("MIPS: vdso: Use generic union vdso_data_store")
> s390: cb3444cfdb48 ("s390/vdso: Use generic union vdso_data_store")
> riscv: eba755314fa7 ("riscv: vdso: Use generic union vdso_data_store")
>
> ia64 is dead
> nds32 is dead
> hexagon has a bunch of vdso work in the logs as well.
>
> There is also a6c19dfe3994 ("arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area")
>
> I do not see sparc changing away from what the patches were doing, but
> again, the arch_unmap() seems to do nothing there as well.
>
> So, what I was looking to do is to avoid a call to arch specific
> functions that does nothing but set the vdso pointer to NULL for
> powerpc.
That's what is doing Dmitry's series, removing arch_unmap() and replace
it with core handling. The advantage being that it addresses it for all
affected architectures, improving the current situation.
>
> The thread referenced in the git bug [1] seems to indicate this is for
> CRIU unmapping/restoring a task, but CRIU now just moves the vdso
> mapping (or just works on ppc at this point?). Since [2] hasn't landed,
> isn't this still broken for CRIU on powerpc as it is?
>
> So, are we keeping the unmap_arch() function around, which has errors
> that were never fixed, for a single application that utilizes a newer
> method of moving the vdso anyways?
Again, we want to remove arch_unmap() but we want the core-mm to handle
it instead.
>
> On the note of CRIU, it seems it cannot handle tasks which don't have
> the vdso mapped anymore [3], so setting it to NULL is probably a bad
> plan even for that one application?
But as mentioned by Dmitry it is not only CRIU. There has also been
issues with Valgrind.
>
>
> So, I think this just leaves the fallback when the VDSO is unmapped..
> Well, it seems what people have been doing is unmap the vdso to stop
> these functions from working [4]. At least this is what some users are
> doing. The ability to replace this vma with a guard vma leads me to
> believe that other archs don't fall back at all - please correct me if
> I'm wrong!
>
> I also cannot find any reference to other archs clearing the
> context.vdso (aside from failures in __setup_additional_pages).
>
> But maybe I don't fully understand how this works?
I think you fully understand that it doesn't work as it is except on
powerpc. Again the goal should be to make it work on all architectures.
Thanks
Christophe
>
> Thanks,
> Liam
>
>
> [1] https://lore.kernel.org/lkml/87d0lht1c0.fsf@concordia.ellerman.id.au/
> [2] https://lore.kernel.org/lkml/9c2b2826-4083-fc9c-5a4d-c101858dd560@linux.vnet.ibm.com/
> [3] https://github.com/checkpoint-restore/criu/issues/488
> [4] https://github.com/insanitybit/void-ship
>
> Thanks,
> Liam
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call
2024-07-10 19:22 ` [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-07-11 14:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2024-07-11 14:16 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
On Wed, Jul 10, 2024 at 03:22:38PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Open code the do_vmi_align_munmap() call so that it can be broken up
> later in the series.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
> mm/mmap.c | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 49b3ab406353..a1544a68558e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2916,6 +2916,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> struct vm_area_struct *next, *prev, *merge;
> pgoff_t pglen = len >> PAGE_SHIFT;
> unsigned long charged = 0;
> + struct vma_munmap_struct vms;
> + struct ma_state mas_detach;
> + struct maple_tree mt_detach;
> unsigned long end = addr + len;
> unsigned long merge_start = addr, merge_end = end;
> bool writable_file_mapping = false;
> @@ -2948,10 +2951,27 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> /* Find the first overlapping VMA */
> vma = vma_find(&vmi, end);
> if (vma) {
> - /* Unmap any existing mapping in the area */
> - if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> + mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> + mt_on_stack(mt_detach);
> + mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> + /* Prepare to unmap any existing mapping in the area */
> + if (vms_gather_munmap_vmas(&vms, &mas_detach))
> + return -ENOMEM;
> +
> + /* Remove any existing mappings from the vma tree */
> + if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> return -ENOMEM;
> +
> + /* Unmap any existing mapping in the area */
> + vms_complete_munmap_vmas(&vms, &mas_detach);
> + next = vms.next;
> + prev = vms.prev;
> + vma_prev(&vmi);
> vma = NULL;
> + } else {
> + next = vma_next(&vmi);
> + prev = vma_prev(&vmi);
> }
>
> /*
> @@ -2964,8 +2984,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> vm_flags |= VM_ACCOUNT;
> }
>
> - next = vma_next(&vmi);
> - prev = vma_prev(&vmi);
> if (vm_flags & VM_SPECIAL) {
> if (prev)
> vma_iter_next_range(&vmi);
> --
> 2.43.0
>
LGTM
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap()
2024-07-10 19:22 ` [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-07-11 14:28 ` Lorenzo Stoakes
2024-07-11 16:04 ` Liam R. Howlett
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2024-07-11 14:28 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
On Wed, Jul 10, 2024 at 03:22:39PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Adding support for a NULL vma means the init_vma_munmap() can be
> initialized for a less error-prone process when calling
> vms_complete_munmap_vmas() later on.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
> mm/mmap.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a1544a68558e..e2e6b3202c25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -516,10 +516,12 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> bool unlock)
> {
> vms->vmi = vmi;
> - vms->vma = vma;
> - vms->mm = vma->vm_mm;
> - vms->start = start;
> - vms->end = end;
> + if (vma) {
> + vms->vma = vma;
> + vms->mm = vma->vm_mm;
> + vms->start = start;
> + vms->end = end;
> + }
Why not store start/end even if !vma? And shouldn't we have an else clause
to make sure these are initialised in this case too?
I mean also we could have vms->vma = vma outside of this clause to so it
looks something like:
vms->vma = vma;
vms->mm = vma ? vma->vm_mm : NULL;
vms->start = start;
vms->end = end;
> vms->unlock = unlock;
> vms->uf = uf;
> vms->vma_count = 0;
> @@ -2950,11 +2952,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
> /* Find the first overlapping VMA */
> vma = vma_find(&vmi, end);
> + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> if (vma) {
> mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> mt_on_stack(mt_detach);
> mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> - init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> /* Prepare to unmap any existing mapping in the area */
> if (vms_gather_munmap_vmas(&vms, &mas_detach))
> return -ENOMEM;
> --
> 2.43.0
>
I really like this approach in general though!
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-07-10 19:22 ` [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-11 15:25 ` Lorenzo Stoakes
2024-07-11 16:07 ` Liam R. Howlett
2024-07-16 12:46 ` kernel test robot
1 sibling, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2024-07-11 15:25 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
On Wed, Jul 10, 2024 at 03:22:43PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation. This is done using the vms_clear_ptes()
> helper.
>
> Temporarily keep track of the number of pages that will be removed and
> reduce the charged amount.
>
> This also drops the validate_mm() call in the vma_expand() function.
> It is necessary to drop the validate as it would fail since the mm
> map_count would be incorrect during a vma expansion, prior to the
> cleanup from vms_complete_munmap_vmas().
>
> Clean up the error handing of the vms_gather_munmap_vmas() by calling
> the verification within the function.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
> mm/internal.h | 1 +
> mm/mmap.c | 80 +++++++++++++++++++++++++++------------------------
> 2 files changed, 44 insertions(+), 37 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 11e90c6e5a3e..dd4eede1be0f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> unsigned long stack_vm;
> unsigned long data_vm;
> bool unlock; /* Unlock after the munmap */
> + bool clear_ptes; /* If there are outstanding PTE to be cleared */
> };
>
> void __meminit __init_single_page(struct page *page, unsigned long pfn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 870c2d04ad6b..58cf42e22bfe 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> }
>
> static unsigned long count_vma_pages_range(struct mm_struct *mm,
> - unsigned long addr, unsigned long end)
> + unsigned long addr, unsigned long end,
> + unsigned long *nr_accounted)
> {
> VMA_ITERATOR(vmi, mm, addr);
> struct vm_area_struct *vma;
> unsigned long nr_pages = 0;
>
> + *nr_accounted = 0;
> for_each_vma_range(vmi, vma, end) {
> unsigned long vm_start = max(addr, vma->vm_start);
> unsigned long vm_end = min(end, vma->vm_end);
>
> nr_pages += PHYS_PFN(vm_end - vm_start);
> + if (vma->vm_flags & VM_ACCOUNT)
> + *nr_accounted += PHYS_PFN(vm_end - vm_start);
> }
>
> return nr_pages;
> @@ -524,6 +528,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> vms->unmap_start = FIRST_USER_ADDRESS;
> vms->unmap_end = USER_PGTABLES_CEILING;
> + vms->clear_ptes = false; /* No PTEs to clear yet */
> }
>
> /*
> @@ -732,7 +737,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> vma_iter_store(vmi, vma);
>
> vma_complete(&vp, vmi, vma->vm_mm);
> - validate_mm(vma->vm_mm);
> return 0;
>
> nomem:
> @@ -2606,11 +2610,14 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> }
>
>
> -static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> +static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
> struct ma_state *mas_detach, bool mm_wr_locked)
> {
> struct mmu_gather tlb;
>
> + if (!vms->clear_ptes) /* Nothing to do */
> + return;
> +
> /*
> * We can free page tables without write-locking mmap_lock because VMAs
> * were isolated before we downgraded mmap_lock.
> @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> /* start and end may be different if there is no prev or next vma. */
> free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> tlb_finish_mmu(&tlb);
> + vms->clear_ptes = false;
> }
>
> /*
> @@ -2647,7 +2655,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> if (vms->unlock)
> mmap_write_downgrade(mm);
>
> - vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> + vms_clear_ptes(vms, mas_detach, !vms->unlock);
> /* Update high watermark before we lower total_vm */
> update_hiwater_vm(mm);
> /* Stat accounting */
> @@ -2799,6 +2807,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> while (vma_iter_addr(vms->vmi) > vms->start)
> vma_iter_prev_range(vms->vmi);
>
> + /* There are now PTEs that need to be cleared */
> + vms->clear_ptes = true;
> +
> return 0;
>
> userfaultfd_error:
> @@ -2807,6 +2818,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> abort_munmap_vmas(mas_detach);
> start_split_failed:
> map_count_exceeded:
> + validate_mm(vms->mm);
I'm guessing here we know it's safe to validate?
> return error;
> }
>
> @@ -2851,8 +2863,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>
> clear_tree_failed:
> abort_munmap_vmas(&mas_detach);
> -gather_failed:
> validate_mm(mm);
Additionally I imagine the gathering failing results in the tree being unable to
be validated?
> +gather_failed:
> return error;
> }
>
> @@ -2940,24 +2952,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> unsigned long merge_start = addr, merge_end = end;
> bool writable_file_mapping = false;
> pgoff_t vm_pgoff;
> - int error;
> + int error = -ENOMEM;
> VMA_ITERATOR(vmi, mm, addr);
> + unsigned long nr_pages, nr_accounted;
>
> - /* Check against address space limit. */
> - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> - unsigned long nr_pages;
> -
> - /*
> - * MAP_FIXED may remove pages of mappings that intersects with
> - * requested mapping. Account for the pages it would unmap.
> - */
> - nr_pages = count_vma_pages_range(mm, addr, end);
> -
> - if (!may_expand_vm(mm, vm_flags,
> - (len >> PAGE_SHIFT) - nr_pages))
> - return -ENOMEM;
> - }
> + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
>
> + /*
> + * Check against address space limit.
> + * MAP_FIXED may remove pages of mappings that intersects with requested
> + * mapping. Account for the pages it would unmap.
> + */
> + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> + return -ENOMEM;
>
> if (unlikely(!can_modify_mm(mm, addr, end)))
> return -EPERM;
> @@ -2974,18 +2981,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> /* Prepare to unmap any existing mapping in the area */
> if (vms_gather_munmap_vmas(&vms, &mas_detach))
> - goto gather_failed;
> -
> - /* Remove any existing mappings from the vma tree */
> - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> - goto clear_tree_failed;
> + return -ENOMEM;
>
> - /* Unmap any existing mapping in the area */
> - vms_complete_munmap_vmas(&vms, &mas_detach);
> next = vms.next;
> prev = vms.prev;
> vma = NULL;
> } else {
> + /* Minimal setup of vms */
Nit, but is this valid now we use the init function unconditionally?
> next = vma_next(&vmi);
> prev = vma_prev(&vmi);
> if (prev)
> @@ -2997,8 +2999,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> */
> if (accountable_mapping(file, vm_flags)) {
> charged = len >> PAGE_SHIFT;
> + charged -= nr_accounted;
> if (security_vm_enough_memory_mm(mm, charged))
> - return -ENOMEM;
> + goto abort_munmap;
> + vms.nr_accounted = 0;
> vm_flags |= VM_ACCOUNT;
> }
>
> @@ -3047,10 +3051,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> * not unmapped, but the maps are removed from the list.
> */
> vma = vm_area_alloc(mm);
> - if (!vma) {
> - error = -ENOMEM;
> + if (!vma)
> goto unacct_error;
> - }
>
> vma_iter_config(&vmi, addr, end);
> vma_set_range(vma, addr, end, pgoff);
> @@ -3059,6 +3061,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
> if (file) {
> vma->vm_file = get_file(file);
> + /* call_mmap() may map PTE, so ensure there are no existing PTEs */
> + vms_clear_ptes(&vms, &mas_detach, true);
> error = call_mmap(file, vma);
> if (error)
> goto unmap_and_free_vma;
> @@ -3149,6 +3153,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> expanded:
> perf_event_mmap(vma);
>
> + /* Unmap any existing mapping in the area */
> + if (vms.nr_pages)
> + vms_complete_munmap_vmas(&vms, &mas_detach);
> +
> vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> if (vm_flags & VM_LOCKED) {
> if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> @@ -3196,14 +3204,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> unacct_error:
> if (charged)
> vm_unacct_memory(charged);
> - validate_mm(mm);
> - return error;
>
> -clear_tree_failed:
> - abort_munmap_vmas(&mas_detach);
> -gather_failed:
> +abort_munmap:
> + if (vms.nr_pages)
> + abort_munmap_vmas(&mas_detach);
> validate_mm(mm);
> - return -ENOMEM;
> + return error;
> }
>
> static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> --
> 2.43.0
>
Other than nits/queries, LGTM:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region()
2024-07-10 19:22 ` [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
@ 2024-07-11 15:38 ` Lorenzo Stoakes
0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2024-07-11 15:38 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
On Wed, Jul 10, 2024 at 03:22:49PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> call, so use it instead of looping over the vmas twice.
This commit message doesn't explain the 'Move may_expand_vm()' part of this
change which is the patch's subject. Should add a little blurb about that.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
> mm/mmap.c | 36 ++++--------------------------------
> 1 file changed, 4 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 18c269bf8703..7a440e7da55a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -401,27 +401,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> }
>
> -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> - unsigned long addr, unsigned long end,
> - unsigned long *nr_accounted)
> -{
> - VMA_ITERATOR(vmi, mm, addr);
> - struct vm_area_struct *vma;
> - unsigned long nr_pages = 0;
> -
> - *nr_accounted = 0;
> - for_each_vma_range(vmi, vma, end) {
> - unsigned long vm_start = max(addr, vma->vm_start);
> - unsigned long vm_end = min(end, vma->vm_end);
> -
> - nr_pages += PHYS_PFN(vm_end - vm_start);
> - if (vma->vm_flags & VM_ACCOUNT)
> - *nr_accounted += PHYS_PFN(vm_end - vm_start);
> - }
> -
> - return nr_pages;
> -}
> -
> static void __vma_link_file(struct vm_area_struct *vma,
> struct address_space *mapping)
> {
> @@ -2949,17 +2928,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> pgoff_t vm_pgoff;
> int error = -ENOMEM;
> VMA_ITERATOR(vmi, mm, addr);
> - unsigned long nr_pages, nr_accounted;
> -
> - nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> -
> - /*
> - * Check against address space limit.
> - * MAP_FIXED may remove pages of mappings that intersects with requested
> - * mapping. Account for the pages it would unmap.
> - */
> - if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> - return -ENOMEM;
>
>
> /* Find the first overlapping VMA */
> @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> vma_iter_next_range(&vmi);
> }
>
> + /* Check against address space limit. */
> + if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> + goto abort_munmap;
> +
> /*
> * Private writable mapping: check memory availability
> */
> --
> 2.43.0
>
This is fine though as discussed previously, though obviously need to think
about the arch_unmap() bit :)
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas()
2024-07-10 19:22 ` [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
@ 2024-07-11 15:39 ` Lorenzo Stoakes
0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2024-07-11 15:39 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
On Wed, Jul 10, 2024 at 03:22:50PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The comment has been outdated since 6b73cff239e52 ("mm: change munmap
> splitting order and move_vma()"). The move_vma() was altered to fix the
> fragile state of the accounting since then.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
> mm/mmap.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7a440e7da55a..ca3872e5fbd8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2675,13 +2675,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>
> /*
> * If we need to split any vma, do it now to save pain later.
> - *
> - * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
> - * unmapped vm_area_struct will remain in use: so lower split_vma
> - * places tmp vma above, and higher split_vma places tmp vma below.
> + * Does it split the first one?
> */
> -
> - /* Does it split the first one? */
> if (vms->start > vms->vma->vm_start) {
>
> /*
> --
> 2.43.0
>
Looks good to me,
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
2024-07-11 8:28 ` LEROY Christophe
@ 2024-07-11 15:59 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-11 15:59 UTC (permalink / raw)
To: LEROY Christophe
Cc: linux-mm@kvack.org, Andrew Morton, Suren Baghdasaryan,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
sidhartha.kumar@oracle.com, Paul E . McKenney, Bert Karwatzki,
Jiri Olsa, linux-kernel@vger.kernel.org, Kees Cook, Dave Hansen,
linuxppc-dev@lists.ozlabs.org, Dmitry Safonov, Michael Ellerman
* LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240711 04:28]:
>
>
> Le 11/07/2024 à 01:26, Liam R. Howlett a écrit :
> > * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 17:02]:
> >>
> >>
> >> Le 10/07/2024 à 21:22, Liam R. Howlett a écrit :
> >>> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >>>
> >>> The arch_unmap call was previously moved above the rbtree modifications
> >>> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> >>> corruption"). The move was motivated by an issue with calling
> >>> arch_unmap() after the rbtree was modified.
> >>>
> >>> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> >>> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> >>> arch_unmap() prior to modifying the vma tree no longer exists
> >>> (regardless of rbtree or maple tree implementations).
> >>>
> >>> Furthermore, the powerpc implementation is also no longer needed as per
> >>> [1] and [2]. So the arch_unmap() function can be completely removed.
> >>
> >> I'm not sure to understand. Is it replaced by something else ?
> >> We wanted to get rid of arch_unmap() but it was supposed to be replaced
> >> by some core function because the functionnality itself is still
> >> required and indeed all the discussion around [2] demonstrated that not
> >> only powerpc but at least arm and probably others needed to properly
> >> clean-up reference to VDSO mappings on unmapping.
> >>
> >> So as mentioned by Michael you can't just drop that without replacing it
> >> by something else. We need the VDSO signal handling to properly fallback
> >> on stack-based trampoline when the VDSO trampoline gets mapped out.
> >
> > I'll address this after the part I missed..
>
> After ? What do you mean ? It needs to be addressed _before_ removing
> arch_unmap()
After the later comments in this email, sorry that wasn't clear.
>
> >
> >>
> >> Or did I miss something ?
> >>
> >
> > I think I missed something in regards to what you need in ppc.
>
> It is not only powerpc. Powerpc is the only one doing it at the moment
> but investigation has demonstrated that other architectures are affected.
>
> >
> > From what I understand, other platforms still map and use the vdso
> > (context.vdso is set), but unmap_arch() does nothing. It is only the
> > powerpc version that clears the vdso pointer if it is unmapped.
>
> Yes on powerpc it works. On other platforms like arm it segfaults so it
> should be fixed
> (https://lore.kernel.org/lkml/87imd5h5kb.fsf@mpe.ellerman.id.au/)
>
> Could be fixed by properly implementing arch_unmap() on every arch, or
> carry-on with Dmitry's series.
Okay, I understand what you are saying now. I'm not going to tackle
that change within this series, so I'll just relocate the arch_munmap()
back to where it was, after the removal of the vmas in v5.
> I think you fully understand that it doesn't work as it is except on
> powerpc. Again the goal should be to make it work on all architectures.
Got it, thanks for clarifying.
Regards,
Liam
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap()
2024-07-11 14:28 ` Lorenzo Stoakes
@ 2024-07-11 16:04 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-11 16:04 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240711 10:28]:
> On Wed, Jul 10, 2024 at 03:22:39PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Adding support for a NULL vma means the init_vma_munmap() can be
> > initialized for a less error-prone process when calling
> > vms_complete_munmap_vmas() later on.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> > mm/mmap.c | 12 +++++++-----
> > 1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index a1544a68558e..e2e6b3202c25 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -516,10 +516,12 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > bool unlock)
> > {
> > vms->vmi = vmi;
> > - vms->vma = vma;
> > - vms->mm = vma->vm_mm;
> > - vms->start = start;
> > - vms->end = end;
> > + if (vma) {
> > + vms->vma = vma;
> > + vms->mm = vma->vm_mm;
> > + vms->start = start;
> > + vms->end = end;
> > + }
>
> Why not store start/end even if !vma? And shouldn't we have an else clause
> to make sure these are initialised in this case too?
>
> I mean also we could have vms->vma = vma outside of this clause to so it
> looks something like:
>
> vms->vma = vma;
> vms->mm = vma ? vma->vm_mm : NULL;
> vms->start = start;
> vms->end = end;
I'd rather not set it the start/end as it implies there is a start/end
of an unmap operation that won't happen. I'll just make it an else and
set them to 0.
>
> > vms->unlock = unlock;
> > vms->uf = uf;
> > vms->vma_count = 0;
> > @@ -2950,11 +2952,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> > /* Find the first overlapping VMA */
> > vma = vma_find(&vmi, end);
> > + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > if (vma) {
> > mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > mt_on_stack(mt_detach);
> > mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > - init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > /* Prepare to unmap any existing mapping in the area */
> > if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > return -ENOMEM;
> > --
> > 2.43.0
> >
>
> I really like this approach in general though!
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-07-11 15:25 ` Lorenzo Stoakes
@ 2024-07-11 16:07 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-11 16:07 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240711 11:25]:
> On Wed, Jul 10, 2024 at 03:22:43PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > If a driver is mapping over an existing vma, then clear the ptes before
> > the call_mmap() invocation. This is done using the vms_clear_ptes()
> > helper.
> >
> > Temporarily keep track of the number of pages that will be removed and
> > reduce the charged amount.
> >
> > This also drops the validate_mm() call in the vma_expand() function.
> > It is necessary to drop the validate as it would fail since the mm
> > map_count would be incorrect during a vma expansion, prior to the
> > cleanup from vms_complete_munmap_vmas().
> >
> > Clean up the error handing of the vms_gather_munmap_vmas() by calling
> > the verification within the function.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> > mm/internal.h | 1 +
> > mm/mmap.c | 80 +++++++++++++++++++++++++++------------------------
> > 2 files changed, 44 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 11e90c6e5a3e..dd4eede1be0f 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> > unsigned long stack_vm;
> > unsigned long data_vm;
> > bool unlock; /* Unlock after the munmap */
> > + bool clear_ptes; /* If there are outstanding PTE to be cleared */
> > };
> >
> > void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 870c2d04ad6b..58cf42e22bfe 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > }
> >
> > static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > - unsigned long addr, unsigned long end)
> > + unsigned long addr, unsigned long end,
> > + unsigned long *nr_accounted)
> > {
> > VMA_ITERATOR(vmi, mm, addr);
> > struct vm_area_struct *vma;
> > unsigned long nr_pages = 0;
> >
> > + *nr_accounted = 0;
> > for_each_vma_range(vmi, vma, end) {
> > unsigned long vm_start = max(addr, vma->vm_start);
> > unsigned long vm_end = min(end, vma->vm_end);
> >
> > nr_pages += PHYS_PFN(vm_end - vm_start);
> > + if (vma->vm_flags & VM_ACCOUNT)
> > + *nr_accounted += PHYS_PFN(vm_end - vm_start);
> > }
> >
> > return nr_pages;
> > @@ -524,6 +528,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > vms->unmap_start = FIRST_USER_ADDRESS;
> > vms->unmap_end = USER_PGTABLES_CEILING;
> > + vms->clear_ptes = false; /* No PTEs to clear yet */
> > }
> >
> > /*
> > @@ -732,7 +737,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > vma_iter_store(vmi, vma);
> >
> > vma_complete(&vp, vmi, vma->vm_mm);
> > - validate_mm(vma->vm_mm);
> > return 0;
> >
> > nomem:
> > @@ -2606,11 +2610,14 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > }
> >
> >
> > -static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > +static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
> > struct ma_state *mas_detach, bool mm_wr_locked)
> > {
> > struct mmu_gather tlb;
> >
> > + if (!vms->clear_ptes) /* Nothing to do */
> > + return;
> > +
> > /*
> > * We can free page tables without write-locking mmap_lock because VMAs
> > * were isolated before we downgraded mmap_lock.
> > @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > /* start and end may be different if there is no prev or next vma. */
> > free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> > tlb_finish_mmu(&tlb);
> > + vms->clear_ptes = false;
> > }
> >
> > /*
> > @@ -2647,7 +2655,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > if (vms->unlock)
> > mmap_write_downgrade(mm);
> >
> > - vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> > + vms_clear_ptes(vms, mas_detach, !vms->unlock);
> > /* Update high watermark before we lower total_vm */
> > update_hiwater_vm(mm);
> > /* Stat accounting */
> > @@ -2799,6 +2807,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > while (vma_iter_addr(vms->vmi) > vms->start)
> > vma_iter_prev_range(vms->vmi);
> >
> > + /* There are now PTEs that need to be cleared */
> > + vms->clear_ptes = true;
> > +
> > return 0;
> >
> > userfaultfd_error:
> > @@ -2807,6 +2818,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > abort_munmap_vmas(mas_detach);
> > start_split_failed:
> > map_count_exceeded:
> > + validate_mm(vms->mm);
>
> I'm guessing here we know it's safe to validate?
verification in the gather state is always safe - we haven't changed the
tree or a vma yet.
>
> > return error;
> > }
> >
> > @@ -2851,8 +2863,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> > clear_tree_failed:
> > abort_munmap_vmas(&mas_detach);
> > -gather_failed:
> > validate_mm(mm);
>
> Additionally I imagine the gathering failing results in the tree being unable to
> be validated?
It is safe, but if it's here then it doesn't need to be above
>
> > +gather_failed:
> > return error;
> > }
> >
> > @@ -2940,24 +2952,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > unsigned long merge_start = addr, merge_end = end;
> > bool writable_file_mapping = false;
> > pgoff_t vm_pgoff;
> > - int error;
> > + int error = -ENOMEM;
> > VMA_ITERATOR(vmi, mm, addr);
> > + unsigned long nr_pages, nr_accounted;
> >
> > - /* Check against address space limit. */
> > - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > - unsigned long nr_pages;
> > -
> > - /*
> > - * MAP_FIXED may remove pages of mappings that intersects with
> > - * requested mapping. Account for the pages it would unmap.
> > - */
> > - nr_pages = count_vma_pages_range(mm, addr, end);
> > -
> > - if (!may_expand_vm(mm, vm_flags,
> > - (len >> PAGE_SHIFT) - nr_pages))
> > - return -ENOMEM;
> > - }
> > + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> >
> > + /*
> > + * Check against address space limit.
> > + * MAP_FIXED may remove pages of mappings that intersects with requested
> > + * mapping. Account for the pages it would unmap.
> > + */
> > + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > + return -ENOMEM;
> >
> > if (unlikely(!can_modify_mm(mm, addr, end)))
> > return -EPERM;
> > @@ -2974,18 +2981,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > /* Prepare to unmap any existing mapping in the area */
> > if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > - goto gather_failed;
> > -
> > - /* Remove any existing mappings from the vma tree */
> > - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > - goto clear_tree_failed;
> > + return -ENOMEM;
> >
> > - /* Unmap any existing mapping in the area */
> > - vms_complete_munmap_vmas(&vms, &mas_detach);
> > next = vms.next;
> > prev = vms.prev;
> > vma = NULL;
> > } else {
> > + /* Minimal setup of vms */
>
> Nit, but is this valid now we use the init function unconditionally?
Yes, that needs to be dropped, thanks.
>
> > next = vma_next(&vmi);
> > prev = vma_prev(&vmi);
> > if (prev)
> > @@ -2997,8 +2999,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > */
> > if (accountable_mapping(file, vm_flags)) {
> > charged = len >> PAGE_SHIFT;
> > + charged -= nr_accounted;
> > if (security_vm_enough_memory_mm(mm, charged))
> > - return -ENOMEM;
> > + goto abort_munmap;
> > + vms.nr_accounted = 0;
> > vm_flags |= VM_ACCOUNT;
> > }
> >
> > @@ -3047,10 +3051,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > * not unmapped, but the maps are removed from the list.
> > */
> > vma = vm_area_alloc(mm);
> > - if (!vma) {
> > - error = -ENOMEM;
> > + if (!vma)
> > goto unacct_error;
> > - }
> >
> > vma_iter_config(&vmi, addr, end);
> > vma_set_range(vma, addr, end, pgoff);
> > @@ -3059,6 +3061,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> > if (file) {
> > vma->vm_file = get_file(file);
> > + /* call_mmap() may map PTE, so ensure there are no existing PTEs */
> > + vms_clear_ptes(&vms, &mas_detach, true);
> > error = call_mmap(file, vma);
> > if (error)
> > goto unmap_and_free_vma;
> > @@ -3149,6 +3153,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > expanded:
> > perf_event_mmap(vma);
> >
> > + /* Unmap any existing mapping in the area */
> > + if (vms.nr_pages)
> > + vms_complete_munmap_vmas(&vms, &mas_detach);
> > +
> > vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > if (vm_flags & VM_LOCKED) {
> > if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > @@ -3196,14 +3204,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > unacct_error:
> > if (charged)
> > vm_unacct_memory(charged);
> > - validate_mm(mm);
> > - return error;
> >
> > -clear_tree_failed:
> > - abort_munmap_vmas(&mas_detach);
> > -gather_failed:
> > +abort_munmap:
> > + if (vms.nr_pages)
> > + abort_munmap_vmas(&mas_detach);
> > validate_mm(mm);
> > - return -ENOMEM;
> > + return error;
> > }
> >
> > static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> > --
> > 2.43.0
> >
>
> Other than nits/queries, LGTM:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-07-10 19:22 ` [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-07-11 15:25 ` Lorenzo Stoakes
@ 2024-07-16 12:46 ` kernel test robot
2024-07-17 17:42 ` Liam R. Howlett
1 sibling, 1 reply; 38+ messages in thread
From: kernel test robot @ 2024-07-16 12:46 UTC (permalink / raw)
To: Liam R. Howlett
Cc: oe-lkp, lkp, linux-mm, ltp, Andrew Morton, Suren Baghdasaryan,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook, Liam R. Howlett, oliver.sang
Hello,
kernel test robot noticed "ltp.hugemmap06.fail" on:
commit: d793398401db9fb81084bd4fe2f782342201df18 ("[PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()")
url: https://github.com/intel-lab-lkp/linux/commits/Liam-R-Howlett/mm-mmap-Correctly-position-vma_iterator-in-__split_vma/20240711-075019
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20240710192250.4114783-15-Liam.Howlett@oracle.com/
patch subject: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
in testcase: ltp
version: ltp-x86_64-14c1f76-1_20240706
with following parameters:
test: hugetlb/hugemmap06
compiler: gcc-13
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (Ivy Bridge) with 16G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407162022.5a730c37-oliver.sang@intel.com
Running tests.......
<<<test_start>>>
tag=hugemmap06 stime=1721029963
cmdline="hugemmap06"
contacts=""
analysis=exit
<<<test_output>>>
tst_hugepage.c:84: TINFO: 255 hugepage(s) reserved
tst_test.c:1803: TINFO: LTP version: 20240524-71-g361f6ad13
tst_test.c:1647: TINFO: Timeout per run is 0h 00m 30s
hugemmap06.c:114: TPASS: No regression found
hugemmap06.c:114: TPASS: No regression found
hugemmap06.c:114: TPASS: No regression found
hugemmap06.c:114: TPASS: No regression found
hugemmap06.c:100: TFAIL: mmap failed: ENOMEM (12)
HINT: You _MAY_ be missing kernel fixes:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f522c3ac00a4
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9119a41e9091
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b24d8616be3
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1406ec9ba6c6
Summary:
passed 4
failed 1
broken 0
skipped 0
warnings 0
incrementing stop
<<<execution_status>>>
initiation_status="ok"
duration=10 termination_type=exited termination_id=1 corefile=no
cutime=2 cstime=629
<<<test_end>>>
INFO: ltp-pan reported some tests FAIL
LTP Version: 20240524-71-g361f6ad13
###############################################################
Done executing testcases.
LTP Version: 20240524-71-g361f6ad13
###############################################################
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240716/202407162022.5a730c37-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack
2024-07-10 19:22 ` [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
@ 2024-07-17 5:03 ` Jeff Xu
2024-07-17 14:07 ` Liam R. Howlett
0 siblings, 1 reply; 38+ messages in thread
From: Jeff Xu @ 2024-07-17 5:03 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
Hi
On Wed, Jul 10, 2024 at 12:23 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Without an arch_unmap() call anymore,
Is there another patch that removes arch_unmap() ?
Can you please post the link for the patch ?
Thanks
-Jeff
> the check for mseal'ed vmas can be
> moved lower as well. This has the benefit of only actually checking if
> things are msealed when there is anything to check. That is, we know
> there is at least one vma that is in the way and needs to be checked.
>
> Only call the can_modify_mm() in do_vmi_align_munmap() and the MAP_FIXED
> case of mmap_region().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Jeff Xu <jeffxu@chromium.org>
> ---
> mm/mmap.c | 24 ++++++++----------------
> 1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index df565f51971d..c343366b3ad2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2849,6 +2849,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> struct vma_munmap_struct vms;
> int error;
>
> + /* Prevent unmapping a sealed VMA. */
> + if (unlikely(!can_modify_mm(mm, start, end)))
> + return -EPERM;
> +
> init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
> error = vms_gather_munmap_vmas(&vms, &mas_detach);
> if (error)
> @@ -2899,13 +2903,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> if (end == start)
> return -EINVAL;
>
> - /*
> - * Prevent unmapping a sealed VMA.
> - * can_modify_mm assumes we have acquired the lock on MM.
> - */
> - if (unlikely(!can_modify_mm(mm, start, end)))
> - return -EPERM;
> -
> /* Find the first overlapping VMA */
> vma = vma_find(vmi, end);
> if (!vma) {
> @@ -2963,13 +2960,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> return -ENOMEM;
>
> - if (unlikely(!can_modify_mm(mm, addr, end)))
> - return -EPERM;
>
> /* Find the first overlapping VMA */
> vma = vma_find(&vmi, end);
> init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> if (vma) {
> + /* Prevent unmapping a sealed VMA. */
> + if (unlikely(!can_modify_mm(mm, addr, end)))
> + return -EPERM;
> +
> mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> mt_on_stack(mt_detach);
> mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> @@ -3341,13 +3340,6 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> {
> struct mm_struct *mm = vma->vm_mm;
>
> - /*
> - * Prevent unmapping a sealed VMA.
> - * can_modify_mm assumes we have acquired the lock on MM.
> - */
> - if (unlikely(!can_modify_mm(mm, start, end)))
> - return -EPERM;
> -
> return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack
2024-07-17 5:03 ` Jeff Xu
@ 2024-07-17 14:07 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-17 14:07 UTC (permalink / raw)
To: Jeff Xu
Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
* Jeff Xu <jeffxu@chromium.org> [240717 01:03]:
> Hi
>
> On Wed, Jul 10, 2024 at 12:23 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Without an arch_unmap() call anymore,
> Is there another patch that removes arch_unmap() ?
> Can you please post the link for the patch ?
>
Thanks for looking at this patch.
The patch to remove arch_unmap() cannot be used as powerpc needs a
replacement. I will be moving the arch_unmap() call later in the unmap
process like it was before mpx moved it (mpx has been dropped from the
kernel). I will add you to the Cc for the whole series next time.
Thanks,
Liam
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-07-16 12:46 ` kernel test robot
@ 2024-07-17 17:42 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2024-07-17 17:42 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-mm, ltp, Andrew Morton, Suren Baghdasaryan,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
Kees Cook
* kernel test robot <oliver.sang@intel.com> [240716 08:47]:
>
>
> Hello,
>
> kernel test robot noticed "ltp.hugemmap06.fail" on:
Hello Robot!
Thank you for finding this, it will certainly help me improve my next
revision of my series!
>
> commit: d793398401db9fb81084bd4fe2f782342201df18 ("[PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()")
> url: https://github.com/intel-lab-lkp/linux/commits/Liam-R-Howlett/mm-mmap-Correctly-position-vma_iterator-in-__split_vma/20240711-075019
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20240710192250.4114783-15-Liam.Howlett@oracle.com/
> patch subject: [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
>
> in testcase: ltp
> version: ltp-x86_64-14c1f76-1_20240706
> with following parameters:
>
> test: hugetlb/hugemmap06
>
>
>
This is because I am trying to set up a MAP_FIXED huge page before
hugetlb_vm_op_close() is called, which removes the reserved huge pages.
I will address this in v5.
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2024-07-17 17:42 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 19:22 [PATCH v4 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
2024-07-11 14:16 ` Lorenzo Stoakes
2024-07-10 19:22 ` [PATCH v4 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
2024-07-11 14:28 ` Lorenzo Stoakes
2024-07-11 16:04 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-07-11 15:25 ` Lorenzo Stoakes
2024-07-11 16:07 ` Liam R. Howlett
2024-07-16 12:46 ` kernel test robot
2024-07-17 17:42 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs Liam R. Howlett
2024-07-10 19:27 ` Dave Hansen
2024-07-10 21:02 ` LEROY Christophe
2024-07-10 23:26 ` Liam R. Howlett
2024-07-11 8:28 ` LEROY Christophe
2024-07-11 15:59 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
2024-07-17 5:03 ` Jeff Xu
2024-07-17 14:07 ` Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
2024-07-10 19:22 ` [PATCH v4 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
2024-07-11 15:38 ` Lorenzo Stoakes
2024-07-10 19:22 ` [PATCH v4 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
2024-07-11 15:39 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).