[PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm
@ 2025-04-28 15:28 Lorenzo Stoakes
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
which is a violation of separation of concerns, and leaves these functions
exposed to the rest of the kernel when they are in fact internal
implementation details.

Resolve this by moving this logic to mm, and making it internal to vma.c,
vma.h.

This also allows us, in future, to provide userland testing around this
functionality.

We additionally abstract dup_mmap() to mm, being careful to ensure
kernel/fork.c acceses this via the mm internal header so it is not exposed
elsewhere in the kernel.

As part of this change, also abstract initial stack allocation performed in
__bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
this code uses vm_area_alloc() and vm_area_free().

In order to do so sensibly, we introduce a new mm/vma_exec.c file, which
contains the code that is shared by mm and exec. This file is added to both
memory mapping and exec sections in MAINTAINERS so both sets of maintainers
can maintain oversight.

As part of this change, we also move relocate_vma_down() to mm/vma_exec.c
so all shared mm/exec functionality is kept in one place.

We add code shared between nommu and mmu-enabled configurations in order to
share VMA allocation, freeing and duplication code correctly while also
keeping these functions available in userland VMA testing.

This is achieved by adding a mm/vma_init.c file which is also compiled by
the userland tests.

v3:
* Establish mm/vma_exec.c for shared exec/mm vma logic, as per Kees.
* Add this file both to exec and mm MAINTAINERS sections so correct
  oversight is provided.
* Add a patch to move relocate_vma_down() to the new mm/vma_exec.c file.
* Move the create_init_stack_vma() function to mm/vma_exec.c also.
* Take the opportunity to also move insert_vm_struct() to mm/vma.c since
  this is no longer needed outside of mm.
* Fixup VMA userland tests to account for the new additions, extend the
  userland test build (as well as the kernel build) to account for
  mm/vma_exec.c.
* Remove __bprm_mm_init() and open code as we are simply calling a
  function, as per Kees.

v2:
* Moved vma init, alloc, free, dup functions to newly created vma_init.c
  function as per Suren, Liam.
* Added MAINTAINERS entry for vma_init.c, added to Makefile.
* Updated mmap_init() comment.
* Propagated tags (thanks everyone!)
* Added detach_free_vma() helper and correctly detached vmas in userland VMA
  test code.
* Updated userland test code to also compile the vma_init.c file.
* Corrected create_init_stack_vma() comment as per Suren.
* Updated commit message as per Suren.
https://lore.kernel.org/all/cover.1745592303.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/cover.1745528282.git.lorenzo.stoakes@oracle.com/

*** BLURB HERE ***

Lorenzo Stoakes (4):
  mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  mm: abstract initial stack setup to mm subsystem
  mm: move dup_mmap() to mm
  mm: perform VMA allocation, freeing, duplication in mm

 MAINTAINERS                      |   3 +
 fs/exec.c                        |  69 +------
 include/linux/mm.h               |   1 -
 kernel/fork.c                    | 277 +--------------------------
 mm/Makefile                      |   4 +-
 mm/internal.h                    |   2 +
 mm/mmap.c                        | 309 ++++++++++++++++++-------------
 mm/nommu.c                       |  12 +-
 mm/vma.c                         |  43 +++++
 mm/vma.h                         |  16 ++
 mm/vma_exec.c                    | 161 ++++++++++++++++
 mm/vma_init.c                    | 101 ++++++++++
 tools/testing/vma/Makefile       |   2 +-
 tools/testing/vma/vma.c          |  27 ++-
 tools/testing/vma/vma_internal.h | 215 ++++++++++++++++++---
 15 files changed, 737 insertions(+), 505 deletions(-)
 create mode 100644 mm/vma_exec.c
 create mode 100644 mm/vma_init.c

--
2.49.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
@ 2025-04-28 15:28 ` Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
                     ` (4 more replies)
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
                   ` (3 subsequent siblings)
  4 siblings, 5 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

There is functionality that overlaps the exec and memory mapping
subsystems. While it properly belongs in mm, it is important that exec
maintainers maintain oversight of this functionality correctly.

We can establish both goals by adding a new mm/vma_exec.c file which
contains these 'glue' functions, and have fs/exec.c import them.

As a part of this change, to ensure that proper oversight is achieved, add
the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.

scripts/get_maintainer.pl can correctly handle files in multiple entries
and this neatly handles the cross-over.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 MAINTAINERS                      |  2 +
 fs/exec.c                        |  3 ++
 include/linux/mm.h               |  1 -
 mm/Makefile                      |  2 +-
 mm/mmap.c                        | 83 ----------------------------
 mm/vma.h                         |  5 ++
 mm/vma_exec.c                    | 92 ++++++++++++++++++++++++++++++++
 tools/testing/vma/Makefile       |  2 +-
 tools/testing/vma/vma.c          |  1 +
 tools/testing/vma/vma_internal.h | 40 ++++++++++++++
 10 files changed, 145 insertions(+), 86 deletions(-)
 create mode 100644 mm/vma_exec.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f5ee0390cdee..1ee1c22e6e36 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8830,6 +8830,7 @@ F:	include/linux/elf.h
 F:	include/uapi/linux/auxvec.h
 F:	include/uapi/linux/binfmts.h
 F:	include/uapi/linux/elf.h
+F:	mm/vma_exec.c
 F:	tools/testing/selftests/exec/
 N:	asm/elf.h
 N:	binfmt
@@ -15654,6 +15655,7 @@ F:	mm/mremap.c
 F:	mm/mseal.c
 F:	mm/vma.c
 F:	mm/vma.h
+F:	mm/vma_exec.c
 F:	mm/vma_internal.h
 F:	tools/testing/selftests/mm/merge.c
 F:	tools/testing/vma/
diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..477bc3f2e966 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -78,6 +78,9 @@
 
 #include <trace/events/sched.h>
 
+/* For vma exec functions. */
+#include "../mm/internal.h"
+
 static int bprm_creds_from_file(struct linux_binprm *bprm);
 
 int suid_dumpable = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21dd110b6655..4fc361df9ad7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3223,7 +3223,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void exit_mmap(struct mm_struct *);
-int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
 bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
 				 unsigned long addr, bool write);
 
diff --git a/mm/Makefile b/mm/Makefile
index 9d7e5b5bb694..15a901bb431a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o vma.o
+			   pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..1794bf6f4dc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1717,89 +1717,6 @@ static int __meminit init_reserve_notifier(void)
 }
 subsys_initcall(init_reserve_notifier);
 
-/*
- * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
- * this VMA and its relocated range, which will now reside at [vma->vm_start -
- * shift, vma->vm_end - shift).
- *
- * This function is almost certainly NOT what you want for anything other than
- * early executable temporary stack relocation.
- */
-int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
-{
-	/*
-	 * The process proceeds as follows:
-	 *
-	 * 1) Use shift to calculate the new vma endpoints.
-	 * 2) Extend vma to cover both the old and new ranges.  This ensures the
-	 *    arguments passed to subsequent functions are consistent.
-	 * 3) Move vma's page tables to the new range.
-	 * 4) Free up any cleared pgd range.
-	 * 5) Shrink the vma to cover only the new range.
-	 */
-
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long old_start = vma->vm_start;
-	unsigned long old_end = vma->vm_end;
-	unsigned long length = old_end - old_start;
-	unsigned long new_start = old_start - shift;
-	unsigned long new_end = old_end - shift;
-	VMA_ITERATOR(vmi, mm, new_start);
-	VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
-	struct vm_area_struct *next;
-	struct mmu_gather tlb;
-	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
-
-	BUG_ON(new_start > new_end);
-
-	/*
-	 * ensure there are no vmas between where we want to go
-	 * and where we are
-	 */
-	if (vma != vma_next(&vmi))
-		return -EFAULT;
-
-	vma_iter_prev_range(&vmi);
-	/*
-	 * cover the whole range: [new_start, old_end)
-	 */
-	vmg.middle = vma;
-	if (vma_expand(&vmg))
-		return -ENOMEM;
-
-	/*
-	 * move the page tables downwards, on failure we rely on
-	 * process cleanup to remove whatever mess we made.
-	 */
-	pmc.for_stack = true;
-	if (length != move_page_tables(&pmc))
-		return -ENOMEM;
-
-	tlb_gather_mmu(&tlb, mm);
-	next = vma_next(&vmi);
-	if (new_end > old_start) {
-		/*
-		 * when the old and new regions overlap clear from new_end.
-		 */
-		free_pgd_range(&tlb, new_end, old_end, new_end,
-			next ? next->vm_start : USER_PGTABLES_CEILING);
-	} else {
-		/*
-		 * otherwise, clean from old_start; this is done to not touch
-		 * the address space in [new_end, old_start) some architectures
-		 * have constraints on va-space that make this illegal (IA64) -
-		 * for the others its just a little faster.
-		 */
-		free_pgd_range(&tlb, old_start, old_end, new_end,
-			next ? next->vm_start : USER_PGTABLES_CEILING);
-	}
-	tlb_finish_mmu(&tlb);
-
-	vma_prev(&vmi);
-	/* Shrink the vma to just the new range */
-	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
-}
-
 #ifdef CONFIG_MMU
 /*
  * Obtain a read lock on mm->mmap_lock, if the specified address is below the
diff --git a/mm/vma.h b/mm/vma.h
index 149926e8a6d1..1ce3e18f01b7 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -548,4 +548,9 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
+/* vma_exec.h */
+#ifdef CONFIG_MMU
+int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
+#endif
+
 #endif	/* __MM_VMA_H */
diff --git a/mm/vma_exec.c b/mm/vma_exec.c
new file mode 100644
index 000000000000..6736ae37f748
--- /dev/null
+++ b/mm/vma_exec.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Functions explicitly implemented for exec functionality which however are
+ * explicitly VMA-only logic.
+ */
+
+#include "vma_internal.h"
+#include "vma.h"
+
+/*
+ * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
+ * this VMA and its relocated range, which will now reside at [vma->vm_start -
+ * shift, vma->vm_end - shift).
+ *
+ * This function is almost certainly NOT what you want for anything other than
+ * early executable temporary stack relocation.
+ */
+int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
+{
+	/*
+	 * The process proceeds as follows:
+	 *
+	 * 1) Use shift to calculate the new vma endpoints.
+	 * 2) Extend vma to cover both the old and new ranges.  This ensures the
+	 *    arguments passed to subsequent functions are consistent.
+	 * 3) Move vma's page tables to the new range.
+	 * 4) Free up any cleared pgd range.
+	 * 5) Shrink the vma to cover only the new range.
+	 */
+
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long old_start = vma->vm_start;
+	unsigned long old_end = vma->vm_end;
+	unsigned long length = old_end - old_start;
+	unsigned long new_start = old_start - shift;
+	unsigned long new_end = old_end - shift;
+	VMA_ITERATOR(vmi, mm, new_start);
+	VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
+	struct vm_area_struct *next;
+	struct mmu_gather tlb;
+	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
+
+	BUG_ON(new_start > new_end);
+
+	/*
+	 * ensure there are no vmas between where we want to go
+	 * and where we are
+	 */
+	if (vma != vma_next(&vmi))
+		return -EFAULT;
+
+	vma_iter_prev_range(&vmi);
+	/*
+	 * cover the whole range: [new_start, old_end)
+	 */
+	vmg.middle = vma;
+	if (vma_expand(&vmg))
+		return -ENOMEM;
+
+	/*
+	 * move the page tables downwards, on failure we rely on
+	 * process cleanup to remove whatever mess we made.
+	 */
+	pmc.for_stack = true;
+	if (length != move_page_tables(&pmc))
+		return -ENOMEM;
+
+	tlb_gather_mmu(&tlb, mm);
+	next = vma_next(&vmi);
+	if (new_end > old_start) {
+		/*
+		 * when the old and new regions overlap clear from new_end.
+		 */
+		free_pgd_range(&tlb, new_end, old_end, new_end,
+			next ? next->vm_start : USER_PGTABLES_CEILING);
+	} else {
+		/*
+		 * otherwise, clean from old_start; this is done to not touch
+		 * the address space in [new_end, old_start) some architectures
+		 * have constraints on va-space that make this illegal (IA64) -
+		 * for the others its just a little faster.
+		 */
+		free_pgd_range(&tlb, old_start, old_end, new_end,
+			next ? next->vm_start : USER_PGTABLES_CEILING);
+	}
+	tlb_finish_mmu(&tlb);
+
+	vma_prev(&vmi);
+	/* Shrink the vma to just the new range */
+	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
+}
diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
index 860fd2311dcc..624040fcf193 100644
--- a/tools/testing/vma/Makefile
+++ b/tools/testing/vma/Makefile
@@ -9,7 +9,7 @@ include ../shared/shared.mk
 OFILES = $(SHARED_OFILES) vma.o maple-shim.o
 TARGETS = vma
 
-vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h
+vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
 
 vma:	$(OFILES)
 	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 7cfd6e31db10..5832ae5d797d 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
  * Directly import the VMA implementation here. Our vma_internal.h wrapper
  * provides userland-equivalent functionality for everything vma.c uses.
  */
+#include "../../../mm/vma_exec.c"
 #include "../../../mm/vma.c"
 
 const struct vm_operations_struct vma_dummy_vm_ops;
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 572ab2cea763..0df19ca0000a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -421,6 +421,28 @@ struct vm_unmapped_area_info {
 	unsigned long start_gap;
 };
 
+struct pagetable_move_control {
+	struct vm_area_struct *old; /* Source VMA. */
+	struct vm_area_struct *new; /* Destination VMA. */
+	unsigned long old_addr; /* Address from which the move begins. */
+	unsigned long old_end; /* Exclusive address at which old range ends. */
+	unsigned long new_addr; /* Address to move page tables to. */
+	unsigned long len_in; /* Bytes to remap specified by user. */
+
+	bool need_rmap_locks; /* Do rmap locks need to be taken? */
+	bool for_stack; /* Is this an early temp stack being moved? */
+};
+
+#define PAGETABLE_MOVE(name, old_, new_, old_addr_, new_addr_, len_)	\
+	struct pagetable_move_control name = {				\
+		.old = old_,						\
+		.new = new_,						\
+		.old_addr = old_addr_,					\
+		.old_end = (old_addr_) + (len_),			\
+		.new_addr = new_addr_,					\
+		.len_in = len_,						\
+	}
+
 static inline void vma_iter_invalidate(struct vma_iterator *vmi)
 {
 	mas_pause(&vmi->mas);
@@ -1240,4 +1262,22 @@ static inline int mapping_map_writable(struct address_space *mapping)
 	return 0;
 }
 
+static inline unsigned long move_page_tables(struct pagetable_move_control *pmc)
+{
+	(void)pmc;
+
+	return 0;
+}
+
+static inline void free_pgd_range(struct mmu_gather *tlb,
+			unsigned long addr, unsigned long end,
+			unsigned long floor, unsigned long ceiling)
+{
+	(void)tlb;
+	(void)addr;
+	(void)end;
+	(void)floor;
+	(void)ceiling;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
@ 2025-04-28 19:12   ` Liam R. Howlett
  2025-04-28 20:14     ` Suren Baghdasaryan
  2025-04-29  6:59   ` Vlastimil Babka
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-04-28 19:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> There is functionality that overlaps the exec and memory mapping
> subsystems. While it properly belongs in mm, it is important that exec
> maintainers maintain oversight of this functionality correctly.
> 
> We can establish both goals by adding a new mm/vma_exec.c file which
> contains these 'glue' functions, and have fs/exec.c import them.
> 
> As a part of this change, to ensure that proper oversight is achieved, add
> the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> 
> scripts/get_maintainer.pl can correctly handle files in multiple entries
> and this neatly handles the cross-over.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  MAINTAINERS                      |  2 +
>  fs/exec.c                        |  3 ++
>  include/linux/mm.h               |  1 -
>  mm/Makefile                      |  2 +-
>  mm/mmap.c                        | 83 ----------------------------
>  mm/vma.h                         |  5 ++
>  mm/vma_exec.c                    | 92 ++++++++++++++++++++++++++++++++
>  tools/testing/vma/Makefile       |  2 +-
>  tools/testing/vma/vma.c          |  1 +
>  tools/testing/vma/vma_internal.h | 40 ++++++++++++++
>  10 files changed, 145 insertions(+), 86 deletions(-)
>  create mode 100644 mm/vma_exec.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f5ee0390cdee..1ee1c22e6e36 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8830,6 +8830,7 @@ F:	include/linux/elf.h
>  F:	include/uapi/linux/auxvec.h
>  F:	include/uapi/linux/binfmts.h
>  F:	include/uapi/linux/elf.h
> +F:	mm/vma_exec.c
>  F:	tools/testing/selftests/exec/
>  N:	asm/elf.h
>  N:	binfmt
> @@ -15654,6 +15655,7 @@ F:	mm/mremap.c
>  F:	mm/mseal.c
>  F:	mm/vma.c
>  F:	mm/vma.h
> +F:	mm/vma_exec.c
>  F:	mm/vma_internal.h
>  F:	tools/testing/selftests/mm/merge.c
>  F:	tools/testing/vma/
> diff --git a/fs/exec.c b/fs/exec.c
> index 8e4ea5f1e64c..477bc3f2e966 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -78,6 +78,9 @@
>  
>  #include <trace/events/sched.h>
>  
> +/* For vma exec functions. */
> +#include "../mm/internal.h"
> +
>  static int bprm_creds_from_file(struct linux_binprm *bprm);
>  
>  int suid_dumpable = 0;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21dd110b6655..4fc361df9ad7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3223,7 +3223,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 9d7e5b5bb694..15a901bb431a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,7 +37,7 @@ mmu-y			:= nommu.o
>  mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
>  			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>  			   msync.o page_vma_mapped.o pagewalk.o \
> -			   pgtable-generic.o rmap.o vmalloc.o vma.o
> +			   pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
>  
>  
>  ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bd210aaf7ebd..1794bf6f4dc0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1717,89 +1717,6 @@ static int __meminit init_reserve_notifier(void)
>  }
>  subsys_initcall(init_reserve_notifier);
>  
> -/*
> - * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> - * this VMA and its relocated range, which will now reside at [vma->vm_start -
> - * shift, vma->vm_end - shift).
> - *
> - * This function is almost certainly NOT what you want for anything other than
> - * early executable temporary stack relocation.
> - */
> -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> -{
> -	/*
> -	 * The process proceeds as follows:
> -	 *
> -	 * 1) Use shift to calculate the new vma endpoints.
> -	 * 2) Extend vma to cover both the old and new ranges.  This ensures the
> -	 *    arguments passed to subsequent functions are consistent.
> -	 * 3) Move vma's page tables to the new range.
> -	 * 4) Free up any cleared pgd range.
> -	 * 5) Shrink the vma to cover only the new range.
> -	 */
> -
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long old_start = vma->vm_start;
> -	unsigned long old_end = vma->vm_end;
> -	unsigned long length = old_end - old_start;
> -	unsigned long new_start = old_start - shift;
> -	unsigned long new_end = old_end - shift;
> -	VMA_ITERATOR(vmi, mm, new_start);
> -	VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> -	struct vm_area_struct *next;
> -	struct mmu_gather tlb;
> -	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> -
> -	BUG_ON(new_start > new_end);
> -
> -	/*
> -	 * ensure there are no vmas between where we want to go
> -	 * and where we are
> -	 */
> -	if (vma != vma_next(&vmi))
> -		return -EFAULT;
> -
> -	vma_iter_prev_range(&vmi);
> -	/*
> -	 * cover the whole range: [new_start, old_end)
> -	 */
> -	vmg.middle = vma;
> -	if (vma_expand(&vmg))
> -		return -ENOMEM;
> -
> -	/*
> -	 * move the page tables downwards, on failure we rely on
> -	 * process cleanup to remove whatever mess we made.
> -	 */
> -	pmc.for_stack = true;
> -	if (length != move_page_tables(&pmc))
> -		return -ENOMEM;
> -
> -	tlb_gather_mmu(&tlb, mm);
> -	next = vma_next(&vmi);
> -	if (new_end > old_start) {
> -		/*
> -		 * when the old and new regions overlap clear from new_end.
> -		 */
> -		free_pgd_range(&tlb, new_end, old_end, new_end,
> -			next ? next->vm_start : USER_PGTABLES_CEILING);
> -	} else {
> -		/*
> -		 * otherwise, clean from old_start; this is done to not touch
> -		 * the address space in [new_end, old_start) some architectures
> -		 * have constraints on va-space that make this illegal (IA64) -
> -		 * for the others its just a little faster.
> -		 */
> -		free_pgd_range(&tlb, old_start, old_end, new_end,
> -			next ? next->vm_start : USER_PGTABLES_CEILING);
> -	}
> -	tlb_finish_mmu(&tlb);
> -
> -	vma_prev(&vmi);
> -	/* Shrink the vma to just the new range */
> -	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> -}
> -
>  #ifdef CONFIG_MMU
>  /*
>   * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> diff --git a/mm/vma.h b/mm/vma.h
> index 149926e8a6d1..1ce3e18f01b7 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -548,4 +548,9 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>  
>  int __vm_munmap(unsigned long start, size_t len, bool unlock);
>  
> +/* vma_exec.h */
> +#ifdef CONFIG_MMU
> +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> +#endif
> +
>  #endif	/* __MM_VMA_H */
> diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> new file mode 100644
> index 000000000000..6736ae37f748
> --- /dev/null
> +++ b/mm/vma_exec.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Functions explicitly implemented for exec functionality which however are
> + * explicitly VMA-only logic.
> + */
> +
> +#include "vma_internal.h"
> +#include "vma.h"
> +
> +/*
> + * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> + * this VMA and its relocated range, which will now reside at [vma->vm_start -
> + * shift, vma->vm_end - shift).
> + *
> + * This function is almost certainly NOT what you want for anything other than
> + * early executable temporary stack relocation.
> + */
> +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> +{
> +	/*
> +	 * The process proceeds as follows:
> +	 *
> +	 * 1) Use shift to calculate the new vma endpoints.
> +	 * 2) Extend vma to cover both the old and new ranges.  This ensures the
> +	 *    arguments passed to subsequent functions are consistent.
> +	 * 3) Move vma's page tables to the new range.
> +	 * 4) Free up any cleared pgd range.
> +	 * 5) Shrink the vma to cover only the new range.
> +	 */
> +
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long old_start = vma->vm_start;
> +	unsigned long old_end = vma->vm_end;
> +	unsigned long length = old_end - old_start;
> +	unsigned long new_start = old_start - shift;
> +	unsigned long new_end = old_end - shift;
> +	VMA_ITERATOR(vmi, mm, new_start);
> +	VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> +	struct vm_area_struct *next;
> +	struct mmu_gather tlb;
> +	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> +
> +	BUG_ON(new_start > new_end);
> +
> +	/*
> +	 * ensure there are no vmas between where we want to go
> +	 * and where we are
> +	 */
> +	if (vma != vma_next(&vmi))
> +		return -EFAULT;
> +
> +	vma_iter_prev_range(&vmi);
> +	/*
> +	 * cover the whole range: [new_start, old_end)
> +	 */
> +	vmg.middle = vma;
> +	if (vma_expand(&vmg))
> +		return -ENOMEM;
> +
> +	/*
> +	 * move the page tables downwards, on failure we rely on
> +	 * process cleanup to remove whatever mess we made.
> +	 */
> +	pmc.for_stack = true;
> +	if (length != move_page_tables(&pmc))
> +		return -ENOMEM;
> +
> +	tlb_gather_mmu(&tlb, mm);
> +	next = vma_next(&vmi);
> +	if (new_end > old_start) {
> +		/*
> +		 * when the old and new regions overlap clear from new_end.
> +		 */
> +		free_pgd_range(&tlb, new_end, old_end, new_end,
> +			next ? next->vm_start : USER_PGTABLES_CEILING);
> +	} else {
> +		/*
> +		 * otherwise, clean from old_start; this is done to not touch
> +		 * the address space in [new_end, old_start) some architectures
> +		 * have constraints on va-space that make this illegal (IA64) -
> +		 * for the others its just a little faster.
> +		 */
> +		free_pgd_range(&tlb, old_start, old_end, new_end,
> +			next ? next->vm_start : USER_PGTABLES_CEILING);
> +	}
> +	tlb_finish_mmu(&tlb);
> +
> +	vma_prev(&vmi);
> +	/* Shrink the vma to just the new range */
> +	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> +}
> diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> index 860fd2311dcc..624040fcf193 100644
> --- a/tools/testing/vma/Makefile
> +++ b/tools/testing/vma/Makefile
> @@ -9,7 +9,7 @@ include ../shared/shared.mk
>  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
>  TARGETS = vma
>  
> -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h
> +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
>  
>  vma:	$(OFILES)
>  	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 7cfd6e31db10..5832ae5d797d 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
>   * Directly import the VMA implementation here. Our vma_internal.h wrapper
>   * provides userland-equivalent functionality for everything vma.c uses.
>   */
> +#include "../../../mm/vma_exec.c"
>  #include "../../../mm/vma.c"
>  
>  const struct vm_operations_struct vma_dummy_vm_ops;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 572ab2cea763..0df19ca0000a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -421,6 +421,28 @@ struct vm_unmapped_area_info {
>  	unsigned long start_gap;
>  };
>  
> +struct pagetable_move_control {
> +	struct vm_area_struct *old; /* Source VMA. */
> +	struct vm_area_struct *new; /* Destination VMA. */
> +	unsigned long old_addr; /* Address from which the move begins. */
> +	unsigned long old_end; /* Exclusive address at which old range ends. */
> +	unsigned long new_addr; /* Address to move page tables to. */
> +	unsigned long len_in; /* Bytes to remap specified by user. */
> +
> +	bool need_rmap_locks; /* Do rmap locks need to be taken? */
> +	bool for_stack; /* Is this an early temp stack being moved? */
> +};
> +
> +#define PAGETABLE_MOVE(name, old_, new_, old_addr_, new_addr_, len_)	\
> +	struct pagetable_move_control name = {				\
> +		.old = old_,						\
> +		.new = new_,						\
> +		.old_addr = old_addr_,					\
> +		.old_end = (old_addr_) + (len_),			\
> +		.new_addr = new_addr_,					\
> +		.len_in = len_,						\
> +	}
> +
>  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
>  {
>  	mas_pause(&vmi->mas);
> @@ -1240,4 +1262,22 @@ static inline int mapping_map_writable(struct address_space *mapping)
>  	return 0;
>  }
>  
> +static inline unsigned long move_page_tables(struct pagetable_move_control *pmc)
> +{
> +	(void)pmc;
> +
> +	return 0;
> +}
> +
> +static inline void free_pgd_range(struct mmu_gather *tlb,
> +			unsigned long addr, unsigned long end,
> +			unsigned long floor, unsigned long ceiling)
> +{
> +	(void)tlb;
> +	(void)addr;
> +	(void)end;
> +	(void)floor;
> +	(void)ceiling;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 19:12   ` Liam R. Howlett
@ 2025-04-28 20:14     ` Suren Baghdasaryan
  2025-04-28 20:26       ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Suren Baghdasaryan @ 2025-04-28 20:14 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton, Vlastimil Babka,
	Jann Horn, Pedro Falcato, David Hildenbrand, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, Suren Baghdasaryan,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 12:20 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> > There is functionality that overlaps the exec and memory mapping
> > subsystems. While it properly belongs in mm, it is important that exec
> > maintainers maintain oversight of this functionality correctly.
> >
> > We can establish both goals by adding a new mm/vma_exec.c file which
> > contains these 'glue' functions, and have fs/exec.c import them.
> >
> > As a part of this change, to ensure that proper oversight is achieved, add
> > the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> >
> > scripts/get_maintainer.pl can correctly handle files in multiple entries
> > and this neatly handles the cross-over.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> > ---
> >  MAINTAINERS                      |  2 +
> >  fs/exec.c                        |  3 ++
> >  include/linux/mm.h               |  1 -
> >  mm/Makefile                      |  2 +-
> >  mm/mmap.c                        | 83 ----------------------------
> >  mm/vma.h                         |  5 ++
> >  mm/vma_exec.c                    | 92 ++++++++++++++++++++++++++++++++
> >  tools/testing/vma/Makefile       |  2 +-
> >  tools/testing/vma/vma.c          |  1 +
> >  tools/testing/vma/vma_internal.h | 40 ++++++++++++++
> >  10 files changed, 145 insertions(+), 86 deletions(-)
> >  create mode 100644 mm/vma_exec.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index f5ee0390cdee..1ee1c22e6e36 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -8830,6 +8830,7 @@ F:      include/linux/elf.h
> >  F:   include/uapi/linux/auxvec.h
> >  F:   include/uapi/linux/binfmts.h
> >  F:   include/uapi/linux/elf.h
> > +F:   mm/vma_exec.c
> >  F:   tools/testing/selftests/exec/
> >  N:   asm/elf.h
> >  N:   binfmt
> > @@ -15654,6 +15655,7 @@ F:    mm/mremap.c
> >  F:   mm/mseal.c
> >  F:   mm/vma.c
> >  F:   mm/vma.h
> > +F:   mm/vma_exec.c
> >  F:   mm/vma_internal.h
> >  F:   tools/testing/selftests/mm/merge.c
> >  F:   tools/testing/vma/
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 8e4ea5f1e64c..477bc3f2e966 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -78,6 +78,9 @@
> >
> >  #include <trace/events/sched.h>
> >
> > +/* For vma exec functions. */
> > +#include "../mm/internal.h"
> > +
> >  static int bprm_creds_from_file(struct linux_binprm *bprm);
> >
> >  int suid_dumpable = 0;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 21dd110b6655..4fc361df9ad7 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3223,7 +3223,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> >  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> >  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> >  extern void exit_mmap(struct mm_struct *);
> > -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> >  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> >                                unsigned long addr, bool write);
> >
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 9d7e5b5bb694..15a901bb431a 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -37,7 +37,7 @@ mmu-y                       := nommu.o
> >  mmu-$(CONFIG_MMU)    := highmem.o memory.o mincore.o \
> >                          mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
> >                          msync.o page_vma_mapped.o pagewalk.o \
> > -                        pgtable-generic.o rmap.o vmalloc.o vma.o
> > +                        pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
> >
> >
> >  ifdef CONFIG_CROSS_MEMORY_ATTACH
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index bd210aaf7ebd..1794bf6f4dc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1717,89 +1717,6 @@ static int __meminit init_reserve_notifier(void)
> >  }
> >  subsys_initcall(init_reserve_notifier);
> >
> > -/*
> > - * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> > - * this VMA and its relocated range, which will now reside at [vma->vm_start -
> > - * shift, vma->vm_end - shift).
> > - *
> > - * This function is almost certainly NOT what you want for anything other than
> > - * early executable temporary stack relocation.
> > - */
> > -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > -{
> > -     /*
> > -      * The process proceeds as follows:
> > -      *
> > -      * 1) Use shift to calculate the new vma endpoints.
> > -      * 2) Extend vma to cover both the old and new ranges.  This ensures the
> > -      *    arguments passed to subsequent functions are consistent.
> > -      * 3) Move vma's page tables to the new range.
> > -      * 4) Free up any cleared pgd range.
> > -      * 5) Shrink the vma to cover only the new range.
> > -      */
> > -
> > -     struct mm_struct *mm = vma->vm_mm;
> > -     unsigned long old_start = vma->vm_start;
> > -     unsigned long old_end = vma->vm_end;
> > -     unsigned long length = old_end - old_start;
> > -     unsigned long new_start = old_start - shift;
> > -     unsigned long new_end = old_end - shift;
> > -     VMA_ITERATOR(vmi, mm, new_start);
> > -     VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> > -     struct vm_area_struct *next;
> > -     struct mmu_gather tlb;
> > -     PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > -
> > -     BUG_ON(new_start > new_end);
> > -
> > -     /*
> > -      * ensure there are no vmas between where we want to go
> > -      * and where we are
> > -      */
> > -     if (vma != vma_next(&vmi))
> > -             return -EFAULT;
> > -
> > -     vma_iter_prev_range(&vmi);
> > -     /*
> > -      * cover the whole range: [new_start, old_end)
> > -      */
> > -     vmg.middle = vma;
> > -     if (vma_expand(&vmg))
> > -             return -ENOMEM;
> > -
> > -     /*
> > -      * move the page tables downwards, on failure we rely on
> > -      * process cleanup to remove whatever mess we made.
> > -      */
> > -     pmc.for_stack = true;
> > -     if (length != move_page_tables(&pmc))
> > -             return -ENOMEM;
> > -
> > -     tlb_gather_mmu(&tlb, mm);
> > -     next = vma_next(&vmi);
> > -     if (new_end > old_start) {
> > -             /*
> > -              * when the old and new regions overlap clear from new_end.
> > -              */
> > -             free_pgd_range(&tlb, new_end, old_end, new_end,
> > -                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > -     } else {
> > -             /*
> > -              * otherwise, clean from old_start; this is done to not touch
> > -              * the address space in [new_end, old_start) some architectures
> > -              * have constraints on va-space that make this illegal (IA64) -
> > -              * for the others its just a little faster.
> > -              */
> > -             free_pgd_range(&tlb, old_start, old_end, new_end,
> > -                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > -     }
> > -     tlb_finish_mmu(&tlb);
> > -
> > -     vma_prev(&vmi);
> > -     /* Shrink the vma to just the new range */
> > -     return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > -}
> > -
> >  #ifdef CONFIG_MMU
> >  /*
> >   * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 149926e8a6d1..1ce3e18f01b7 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -548,4 +548,9 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> >
> >  int __vm_munmap(unsigned long start, size_t len, bool unlock);
> >
> > +/* vma_exec.h */

nit: Did you mean vma_exec.c ?

> > +#ifdef CONFIG_MMU
> > +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> > +#endif
> > +
> >  #endif       /* __MM_VMA_H */
> > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > new file mode 100644
> > index 000000000000..6736ae37f748
> > --- /dev/null
> > +++ b/mm/vma_exec.c
> > @@ -0,0 +1,92 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +/*
> > + * Functions explicitly implemented for exec functionality which however are
> > + * explicitly VMA-only logic.
> > + */
> > +
> > +#include "vma_internal.h"
> > +#include "vma.h"
> > +
> > +/*
> > + * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> > + * this VMA and its relocated range, which will now reside at [vma->vm_start -
> > + * shift, vma->vm_end - shift).
> > + *
> > + * This function is almost certainly NOT what you want for anything other than
> > + * early executable temporary stack relocation.
> > + */
> > +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > +{
> > +     /*
> > +      * The process proceeds as follows:
> > +      *
> > +      * 1) Use shift to calculate the new vma endpoints.
> > +      * 2) Extend vma to cover both the old and new ranges.  This ensures the
> > +      *    arguments passed to subsequent functions are consistent.
> > +      * 3) Move vma's page tables to the new range.
> > +      * 4) Free up any cleared pgd range.
> > +      * 5) Shrink the vma to cover only the new range.
> > +      */
> > +
> > +     struct mm_struct *mm = vma->vm_mm;
> > +     unsigned long old_start = vma->vm_start;
> > +     unsigned long old_end = vma->vm_end;
> > +     unsigned long length = old_end - old_start;
> > +     unsigned long new_start = old_start - shift;
> > +     unsigned long new_end = old_end - shift;
> > +     VMA_ITERATOR(vmi, mm, new_start);
> > +     VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> > +     struct vm_area_struct *next;
> > +     struct mmu_gather tlb;
> > +     PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > +
> > +     BUG_ON(new_start > new_end);
> > +
> > +     /*
> > +      * ensure there are no vmas between where we want to go
> > +      * and where we are
> > +      */
> > +     if (vma != vma_next(&vmi))
> > +             return -EFAULT;
> > +
> > +     vma_iter_prev_range(&vmi);
> > +     /*
> > +      * cover the whole range: [new_start, old_end)
> > +      */
> > +     vmg.middle = vma;
> > +     if (vma_expand(&vmg))
> > +             return -ENOMEM;
> > +
> > +     /*
> > +      * move the page tables downwards, on failure we rely on
> > +      * process cleanup to remove whatever mess we made.
> > +      */
> > +     pmc.for_stack = true;
> > +     if (length != move_page_tables(&pmc))
> > +             return -ENOMEM;
> > +
> > +     tlb_gather_mmu(&tlb, mm);
> > +     next = vma_next(&vmi);
> > +     if (new_end > old_start) {
> > +             /*
> > +              * when the old and new regions overlap clear from new_end.
> > +              */
> > +             free_pgd_range(&tlb, new_end, old_end, new_end,
> > +                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > +     } else {
> > +             /*
> > +              * otherwise, clean from old_start; this is done to not touch
> > +              * the address space in [new_end, old_start) some architectures
> > +              * have constraints on va-space that make this illegal (IA64) -
> > +              * for the others its just a little faster.
> > +              */
> > +             free_pgd_range(&tlb, old_start, old_end, new_end,
> > +                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > +     }
> > +     tlb_finish_mmu(&tlb);
> > +
> > +     vma_prev(&vmi);
> > +     /* Shrink the vma to just the new range */
> > +     return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > +}
> > diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> > index 860fd2311dcc..624040fcf193 100644
> > --- a/tools/testing/vma/Makefile
> > +++ b/tools/testing/vma/Makefile
> > @@ -9,7 +9,7 @@ include ../shared/shared.mk
> >  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
> >  TARGETS = vma
> >
> > -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h
> > +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> >
> >  vma: $(OFILES)
> >       $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > index 7cfd6e31db10..5832ae5d797d 100644
> > --- a/tools/testing/vma/vma.c
> > +++ b/tools/testing/vma/vma.c
> > @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
> >   * Directly import the VMA implementation here. Our vma_internal.h wrapper
> >   * provides userland-equivalent functionality for everything vma.c uses.
> >   */
> > +#include "../../../mm/vma_exec.c"
> >  #include "../../../mm/vma.c"
> >
> >  const struct vm_operations_struct vma_dummy_vm_ops;
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 572ab2cea763..0df19ca0000a 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -421,6 +421,28 @@ struct vm_unmapped_area_info {
> >       unsigned long start_gap;
> >  };
> >
> > +struct pagetable_move_control {
> > +     struct vm_area_struct *old; /* Source VMA. */
> > +     struct vm_area_struct *new; /* Destination VMA. */
> > +     unsigned long old_addr; /* Address from which the move begins. */
> > +     unsigned long old_end; /* Exclusive address at which old range ends. */
> > +     unsigned long new_addr; /* Address to move page tables to. */
> > +     unsigned long len_in; /* Bytes to remap specified by user. */
> > +
> > +     bool need_rmap_locks; /* Do rmap locks need to be taken? */
> > +     bool for_stack; /* Is this an early temp stack being moved? */
> > +};
> > +
> > +#define PAGETABLE_MOVE(name, old_, new_, old_addr_, new_addr_, len_) \
> > +     struct pagetable_move_control name = {                          \
> > +             .old = old_,                                            \
> > +             .new = new_,                                            \
> > +             .old_addr = old_addr_,                                  \
> > +             .old_end = (old_addr_) + (len_),                        \
> > +             .new_addr = new_addr_,                                  \
> > +             .len_in = len_,                                         \
> > +     }
> > +
> >  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
> >  {
> >       mas_pause(&vmi->mas);
> > @@ -1240,4 +1262,22 @@ static inline int mapping_map_writable(struct address_space *mapping)
> >       return 0;
> >  }
> >
> > +static inline unsigned long move_page_tables(struct pagetable_move_control *pmc)
> > +{
> > +     (void)pmc;
> > +
> > +     return 0;
> > +}
> > +
> > +static inline void free_pgd_range(struct mmu_gather *tlb,
> > +                     unsigned long addr, unsigned long end,
> > +                     unsigned long floor, unsigned long ceiling)
> > +{
> > +     (void)tlb;
> > +     (void)addr;
> > +     (void)end;
> > +     (void)floor;
> > +     (void)ceiling;
> > +}
> > +
> >  #endif       /* __MM_VMA_INTERNAL_H */
> > --
> > 2.49.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 20:14     ` Suren Baghdasaryan
@ 2025-04-28 20:26       ` Lorenzo Stoakes
  2025-04-28 23:08         ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Andrew Morton, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

Andrew - I typo'd /* vma_exec.h */ below in the change to mm/vma.h - would it be
possible to correct to vma_exec.c, or would a fixpatch make life easier?

Cheers, Lorenzo

On Mon, Apr 28, 2025 at 01:14:31PM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 28, 2025 at 12:20 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> > > There is functionality that overlaps the exec and memory mapping
> > > subsystems. While it properly belongs in mm, it is important that exec
> > > maintainers maintain oversight of this functionality correctly.
> > >
> > > We can establish both goals by adding a new mm/vma_exec.c file which
> > > contains these 'glue' functions, and have fs/exec.c import them.
> > >
> > > As a part of this change, to ensure that proper oversight is achieved, add
> > > the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> > >
> > > scripts/get_maintainer.pl can correctly handle files in multiple entries
> > > and this neatly handles the cross-over.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

>
> >
> > > ---
> > >  MAINTAINERS                      |  2 +
> > >  fs/exec.c                        |  3 ++
> > >  include/linux/mm.h               |  1 -
> > >  mm/Makefile                      |  2 +-
> > >  mm/mmap.c                        | 83 ----------------------------
> > >  mm/vma.h                         |  5 ++
> > >  mm/vma_exec.c                    | 92 ++++++++++++++++++++++++++++++++
> > >  tools/testing/vma/Makefile       |  2 +-
> > >  tools/testing/vma/vma.c          |  1 +
> > >  tools/testing/vma/vma_internal.h | 40 ++++++++++++++
> > >  10 files changed, 145 insertions(+), 86 deletions(-)
> > >  create mode 100644 mm/vma_exec.c
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index f5ee0390cdee..1ee1c22e6e36 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -8830,6 +8830,7 @@ F:      include/linux/elf.h
> > >  F:   include/uapi/linux/auxvec.h
> > >  F:   include/uapi/linux/binfmts.h
> > >  F:   include/uapi/linux/elf.h
> > > +F:   mm/vma_exec.c
> > >  F:   tools/testing/selftests/exec/
> > >  N:   asm/elf.h
> > >  N:   binfmt
> > > @@ -15654,6 +15655,7 @@ F:    mm/mremap.c
> > >  F:   mm/mseal.c
> > >  F:   mm/vma.c
> > >  F:   mm/vma.h
> > > +F:   mm/vma_exec.c
> > >  F:   mm/vma_internal.h
> > >  F:   tools/testing/selftests/mm/merge.c
> > >  F:   tools/testing/vma/
> > > diff --git a/fs/exec.c b/fs/exec.c
> > > index 8e4ea5f1e64c..477bc3f2e966 100644
> > > --- a/fs/exec.c
> > > +++ b/fs/exec.c
> > > @@ -78,6 +78,9 @@
> > >
> > >  #include <trace/events/sched.h>
> > >
> > > +/* For vma exec functions. */
> > > +#include "../mm/internal.h"
> > > +
> > >  static int bprm_creds_from_file(struct linux_binprm *bprm);
> > >
> > >  int suid_dumpable = 0;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 21dd110b6655..4fc361df9ad7 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3223,7 +3223,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> > >  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> > >  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> > >  extern void exit_mmap(struct mm_struct *);
> > > -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> > >  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> > >                                unsigned long addr, bool write);
> > >
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index 9d7e5b5bb694..15a901bb431a 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -37,7 +37,7 @@ mmu-y                       := nommu.o
> > >  mmu-$(CONFIG_MMU)    := highmem.o memory.o mincore.o \
> > >                          mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
> > >                          msync.o page_vma_mapped.o pagewalk.o \
> > > -                        pgtable-generic.o rmap.o vmalloc.o vma.o
> > > +                        pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
> > >
> > >
> > >  ifdef CONFIG_CROSS_MEMORY_ATTACH
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index bd210aaf7ebd..1794bf6f4dc0 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1717,89 +1717,6 @@ static int __meminit init_reserve_notifier(void)
> > >  }
> > >  subsys_initcall(init_reserve_notifier);
> > >
> > > -/*
> > > - * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> > > - * this VMA and its relocated range, which will now reside at [vma->vm_start -
> > > - * shift, vma->vm_end - shift).
> > > - *
> > > - * This function is almost certainly NOT what you want for anything other than
> > > - * early executable temporary stack relocation.
> > > - */
> > > -int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > > -{
> > > -     /*
> > > -      * The process proceeds as follows:
> > > -      *
> > > -      * 1) Use shift to calculate the new vma endpoints.
> > > -      * 2) Extend vma to cover both the old and new ranges.  This ensures the
> > > -      *    arguments passed to subsequent functions are consistent.
> > > -      * 3) Move vma's page tables to the new range.
> > > -      * 4) Free up any cleared pgd range.
> > > -      * 5) Shrink the vma to cover only the new range.
> > > -      */
> > > -
> > > -     struct mm_struct *mm = vma->vm_mm;
> > > -     unsigned long old_start = vma->vm_start;
> > > -     unsigned long old_end = vma->vm_end;
> > > -     unsigned long length = old_end - old_start;
> > > -     unsigned long new_start = old_start - shift;
> > > -     unsigned long new_end = old_end - shift;
> > > -     VMA_ITERATOR(vmi, mm, new_start);
> > > -     VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> > > -     struct vm_area_struct *next;
> > > -     struct mmu_gather tlb;
> > > -     PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > > -
> > > -     BUG_ON(new_start > new_end);
> > > -
> > > -     /*
> > > -      * ensure there are no vmas between where we want to go
> > > -      * and where we are
> > > -      */
> > > -     if (vma != vma_next(&vmi))
> > > -             return -EFAULT;
> > > -
> > > -     vma_iter_prev_range(&vmi);
> > > -     /*
> > > -      * cover the whole range: [new_start, old_end)
> > > -      */
> > > -     vmg.middle = vma;
> > > -     if (vma_expand(&vmg))
> > > -             return -ENOMEM;
> > > -
> > > -     /*
> > > -      * move the page tables downwards, on failure we rely on
> > > -      * process cleanup to remove whatever mess we made.
> > > -      */
> > > -     pmc.for_stack = true;
> > > -     if (length != move_page_tables(&pmc))
> > > -             return -ENOMEM;
> > > -
> > > -     tlb_gather_mmu(&tlb, mm);
> > > -     next = vma_next(&vmi);
> > > -     if (new_end > old_start) {
> > > -             /*
> > > -              * when the old and new regions overlap clear from new_end.
> > > -              */
> > > -             free_pgd_range(&tlb, new_end, old_end, new_end,
> > > -                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > > -     } else {
> > > -             /*
> > > -              * otherwise, clean from old_start; this is done to not touch
> > > -              * the address space in [new_end, old_start) some architectures
> > > -              * have constraints on va-space that make this illegal (IA64) -
> > > -              * for the others its just a little faster.
> > > -              */
> > > -             free_pgd_range(&tlb, old_start, old_end, new_end,
> > > -                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > > -     }
> > > -     tlb_finish_mmu(&tlb);
> > > -
> > > -     vma_prev(&vmi);
> > > -     /* Shrink the vma to just the new range */
> > > -     return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > > -}
> > > -
> > >  #ifdef CONFIG_MMU
> > >  /*
> > >   * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> > > diff --git a/mm/vma.h b/mm/vma.h
> > > index 149926e8a6d1..1ce3e18f01b7 100644
> > > --- a/mm/vma.h
> > > +++ b/mm/vma.h
> > > @@ -548,4 +548,9 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> > >
> > >  int __vm_munmap(unsigned long start, size_t len, bool unlock);
> > >
> > > +/* vma_exec.h */
>
> nit: Did you mean vma_exec.c ?

Oops yeah, I did the same for vma_init.[ch] too lol, so at least consistent...

>
> > > +#ifdef CONFIG_MMU
> > > +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> > > +#endif
> > > +
> > >  #endif       /* __MM_VMA_H */
> > > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > > new file mode 100644
> > > index 000000000000..6736ae37f748
> > > --- /dev/null
> > > +++ b/mm/vma_exec.c
> > > @@ -0,0 +1,92 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +/*
> > > + * Functions explicitly implemented for exec functionality which however are
> > > + * explicitly VMA-only logic.
> > > + */
> > > +
> > > +#include "vma_internal.h"
> > > +#include "vma.h"
> > > +
> > > +/*
> > > + * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
> > > + * this VMA and its relocated range, which will now reside at [vma->vm_start -
> > > + * shift, vma->vm_end - shift).
> > > + *
> > > + * This function is almost certainly NOT what you want for anything other than
> > > + * early executable temporary stack relocation.
> > > + */
> > > +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > > +{
> > > +     /*
> > > +      * The process proceeds as follows:
> > > +      *
> > > +      * 1) Use shift to calculate the new vma endpoints.
> > > +      * 2) Extend vma to cover both the old and new ranges.  This ensures the
> > > +      *    arguments passed to subsequent functions are consistent.
> > > +      * 3) Move vma's page tables to the new range.
> > > +      * 4) Free up any cleared pgd range.
> > > +      * 5) Shrink the vma to cover only the new range.
> > > +      */
> > > +
> > > +     struct mm_struct *mm = vma->vm_mm;
> > > +     unsigned long old_start = vma->vm_start;
> > > +     unsigned long old_end = vma->vm_end;
> > > +     unsigned long length = old_end - old_start;
> > > +     unsigned long new_start = old_start - shift;
> > > +     unsigned long new_end = old_end - shift;
> > > +     VMA_ITERATOR(vmi, mm, new_start);
> > > +     VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
> > > +     struct vm_area_struct *next;
> > > +     struct mmu_gather tlb;
> > > +     PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > > +
> > > +     BUG_ON(new_start > new_end);
> > > +
> > > +     /*
> > > +      * ensure there are no vmas between where we want to go
> > > +      * and where we are
> > > +      */
> > > +     if (vma != vma_next(&vmi))
> > > +             return -EFAULT;
> > > +
> > > +     vma_iter_prev_range(&vmi);
> > > +     /*
> > > +      * cover the whole range: [new_start, old_end)
> > > +      */
> > > +     vmg.middle = vma;
> > > +     if (vma_expand(&vmg))
> > > +             return -ENOMEM;
> > > +
> > > +     /*
> > > +      * move the page tables downwards, on failure we rely on
> > > +      * process cleanup to remove whatever mess we made.
> > > +      */
> > > +     pmc.for_stack = true;
> > > +     if (length != move_page_tables(&pmc))
> > > +             return -ENOMEM;
> > > +
> > > +     tlb_gather_mmu(&tlb, mm);
> > > +     next = vma_next(&vmi);
> > > +     if (new_end > old_start) {
> > > +             /*
> > > +              * when the old and new regions overlap clear from new_end.
> > > +              */
> > > +             free_pgd_range(&tlb, new_end, old_end, new_end,
> > > +                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > > +     } else {
> > > +             /*
> > > +              * otherwise, clean from old_start; this is done to not touch
> > > +              * the address space in [new_end, old_start) some architectures
> > > +              * have constraints on va-space that make this illegal (IA64) -
> > > +              * for the others its just a little faster.
> > > +              */
> > > +             free_pgd_range(&tlb, old_start, old_end, new_end,
> > > +                     next ? next->vm_start : USER_PGTABLES_CEILING);
> > > +     }
> > > +     tlb_finish_mmu(&tlb);
> > > +
> > > +     vma_prev(&vmi);
> > > +     /* Shrink the vma to just the new range */
> > > +     return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > > +}
> > > diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> > > index 860fd2311dcc..624040fcf193 100644
> > > --- a/tools/testing/vma/Makefile
> > > +++ b/tools/testing/vma/Makefile
> > > @@ -9,7 +9,7 @@ include ../shared/shared.mk
> > >  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
> > >  TARGETS = vma
> > >
> > > -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h
> > > +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> > >
> > >  vma: $(OFILES)
> > >       $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> > > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > > index 7cfd6e31db10..5832ae5d797d 100644
> > > --- a/tools/testing/vma/vma.c
> > > +++ b/tools/testing/vma/vma.c
> > > @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
> > >   * Directly import the VMA implementation here. Our vma_internal.h wrapper
> > >   * provides userland-equivalent functionality for everything vma.c uses.
> > >   */
> > > +#include "../../../mm/vma_exec.c"
> > >  #include "../../../mm/vma.c"
> > >
> > >  const struct vm_operations_struct vma_dummy_vm_ops;
> > > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > > index 572ab2cea763..0df19ca0000a 100644
> > > --- a/tools/testing/vma/vma_internal.h
> > > +++ b/tools/testing/vma/vma_internal.h
> > > @@ -421,6 +421,28 @@ struct vm_unmapped_area_info {
> > >       unsigned long start_gap;
> > >  };
> > >
> > > +struct pagetable_move_control {
> > > +     struct vm_area_struct *old; /* Source VMA. */
> > > +     struct vm_area_struct *new; /* Destination VMA. */
> > > +     unsigned long old_addr; /* Address from which the move begins. */
> > > +     unsigned long old_end; /* Exclusive address at which old range ends. */
> > > +     unsigned long new_addr; /* Address to move page tables to. */
> > > +     unsigned long len_in; /* Bytes to remap specified by user. */
> > > +
> > > +     bool need_rmap_locks; /* Do rmap locks need to be taken? */
> > > +     bool for_stack; /* Is this an early temp stack being moved? */
> > > +};
> > > +
> > > +#define PAGETABLE_MOVE(name, old_, new_, old_addr_, new_addr_, len_) \
> > > +     struct pagetable_move_control name = {                          \
> > > +             .old = old_,                                            \
> > > +             .new = new_,                                            \
> > > +             .old_addr = old_addr_,                                  \
> > > +             .old_end = (old_addr_) + (len_),                        \
> > > +             .new_addr = new_addr_,                                  \
> > > +             .len_in = len_,                                         \
> > > +     }
> > > +
> > >  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
> > >  {
> > >       mas_pause(&vmi->mas);
> > > @@ -1240,4 +1262,22 @@ static inline int mapping_map_writable(struct address_space *mapping)
> > >       return 0;
> > >  }
> > >
> > > +static inline unsigned long move_page_tables(struct pagetable_move_control *pmc)
> > > +{
> > > +     (void)pmc;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static inline void free_pgd_range(struct mmu_gather *tlb,
> > > +                     unsigned long addr, unsigned long end,
> > > +                     unsigned long floor, unsigned long ceiling)
> > > +{
> > > +     (void)tlb;
> > > +     (void)addr;
> > > +     (void)end;
> > > +     (void)floor;
> > > +     (void)ceiling;
> > > +}
> > > +
> > >  #endif       /* __MM_VMA_INTERNAL_H */
> > > --
> > > 2.49.0
> > >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 20:26       ` Lorenzo Stoakes
@ 2025-04-28 23:08         ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2025-04-28 23:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, 28 Apr 2025 21:26:29 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Andrew - I typo'd /* vma_exec.h */ below in the change to mm/vma.h - would it be
> possible to correct to vma_exec.c, or would a fixpatch make life easier?
> 

I did this:

--- a/mm/vma.h~mm-establish-mm-vma_execc-for-shared-exec-mm-vma-functionality-fix
+++ a/mm/vma.h
@@ -548,7 +548,7 @@ int expand_downwards(struct vm_area_stru
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
-/* vma_exec.h */
+/* vma_exec.c */
 #ifdef CONFIG_MMU
 int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
 #endif
_


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
@ 2025-04-29  6:59   ` Vlastimil Babka
  2025-04-29 16:53   ` Kees Cook
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-04-29  6:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, David Hildenbrand,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 4/28/25 17:28, Lorenzo Stoakes wrote:
> There is functionality that overlaps the exec and memory mapping
> subsystems. While it properly belongs in mm, it is important that exec
> maintainers maintain oversight of this functionality correctly.
> 
> We can establish both goals by adding a new mm/vma_exec.c file which
> contains these 'glue' functions, and have fs/exec.c import them.
> 
> As a part of this change, to ensure that proper oversight is achieved, add
> the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> 
> scripts/get_maintainer.pl can correctly handle files in multiple entries
> and this neatly handles the cross-over.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
  2025-04-29  6:59   ` Vlastimil Babka
@ 2025-04-29 16:53   ` Kees Cook
  2025-04-29 17:22   ` David Hildenbrand
  2025-04-29 17:48   ` Pedro Falcato
  4 siblings, 0 replies; 36+ messages in thread
From: Kees Cook @ 2025-04-29 16:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 04:28:14PM +0100, Lorenzo Stoakes wrote:
> There is functionality that overlaps the exec and memory mapping
> subsystems. While it properly belongs in mm, it is important that exec
> maintainers maintain oversight of this functionality correctly.
> 
> We can establish both goals by adding a new mm/vma_exec.c file which
> contains these 'glue' functions, and have fs/exec.c import them.
> 
> As a part of this change, to ensure that proper oversight is achieved, add
> the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> 
> scripts/get_maintainer.pl can correctly handle files in multiple entries
> and this neatly handles the cross-over.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

(I realize I didn't actually send tags...)

Reviewed-by: Kees Cook <kees@kernel.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2025-04-29 16:53   ` Kees Cook
@ 2025-04-29 17:22   ` David Hildenbrand
  2025-04-29 17:48   ` Pedro Falcato
  4 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-04-29 17:22 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 28.04.25 17:28, Lorenzo Stoakes wrote:
> There is functionality that overlaps the exec and memory mapping
> subsystems. While it properly belongs in mm, it is important that exec
> maintainers maintain oversight of this functionality correctly.
> 
> We can establish both goals by adding a new mm/vma_exec.c file which
> contains these 'glue' functions, and have fs/exec.c import them.
> 
> As a part of this change, to ensure that proper oversight is achieved, add
> the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> 
> scripts/get_maintainer.pl can correctly handle files in multiple entries
> and this neatly handles the cross-over.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
                     ` (3 preceding siblings ...)
  2025-04-29 17:22   ` David Hildenbrand
@ 2025-04-29 17:48   ` Pedro Falcato
  4 siblings, 0 replies; 36+ messages in thread
From: Pedro Falcato @ 2025-04-29 17:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Apr 28, 2025 at 04:28:14PM +0100, Lorenzo Stoakes wrote:
> There is functionality that overlaps the exec and memory mapping
> subsystems. While it properly belongs in mm, it is important that exec
> maintainers maintain oversight of this functionality correctly.
> 
> We can establish both goals by adding a new mm/vma_exec.c file which
> contains these 'glue' functions, and have fs/exec.c import them.
> 
> As a part of this change, to ensure that proper oversight is achieved, add
> the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections.
> 
> scripts/get_maintainer.pl can correctly handle files in multiple entries
> and this neatly handles the cross-over.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
-- 
Pedro

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem
  2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
@ 2025-04-28 15:28 ` Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
                     ` (3 more replies)
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

There are peculiarities within the kernel where what is very clearly mm
code is performed elsewhere arbitrarily.

This violates separation of concerns and makes it harder to refactor code
to make changes to how fundamental initialisation and operation of mm logic
is performed.

One such case is the creation of the VMA containing the initial stack upon
execve()'ing a new process. This is currently performed in __bprm_mm_init()
in fs/exec.c.

Abstract this operation to create_init_stack_vma(). This allows us to limit
use of vma allocation and free code to fork and mm only.

We previously did the same for the step at which we relocate the initial
stack VMA downwards via relocate_vma_down(), now we move the initial VMA
establishment too.

Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no
longer needed anywhere outside of mm.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/exec.c                        | 66 +++---------------------------
 mm/mmap.c                        | 42 -------------------
 mm/vma.c                         | 43 ++++++++++++++++++++
 mm/vma.h                         |  4 ++
 mm/vma_exec.c                    | 69 ++++++++++++++++++++++++++++++++
 tools/testing/vma/vma_internal.h | 32 +++++++++++++++
 6 files changed, 153 insertions(+), 103 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 477bc3f2e966..f9bbcf0016a4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -245,60 +245,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
 	flush_cache_page(bprm->vma, pos, page_to_pfn(page));
 }
 
-static int __bprm_mm_init(struct linux_binprm *bprm)
-{
-	int err;
-	struct vm_area_struct *vma = NULL;
-	struct mm_struct *mm = bprm->mm;
-
-	bprm->vma = vma = vm_area_alloc(mm);
-	if (!vma)
-		return -ENOMEM;
-	vma_set_anonymous(vma);
-
-	if (mmap_write_lock_killable(mm)) {
-		err = -EINTR;
-		goto err_free;
-	}
-
-	/*
-	 * Need to be called with mmap write lock
-	 * held, to avoid race with ksmd.
-	 */
-	err = ksm_execve(mm);
-	if (err)
-		goto err_ksm;
-
-	/*
-	 * Place the stack at the largest stack address the architecture
-	 * supports. Later, we'll move this to an appropriate place. We don't
-	 * use STACK_TOP because that can depend on attributes which aren't
-	 * configured yet.
-	 */
-	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_end = STACK_TOP_MAX;
-	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
-
-	err = insert_vm_struct(mm, vma);
-	if (err)
-		goto err;
-
-	mm->stack_vm = mm->total_vm = 1;
-	mmap_write_unlock(mm);
-	bprm->p = vma->vm_end - sizeof(void *);
-	return 0;
-err:
-	ksm_exit(mm);
-err_ksm:
-	mmap_write_unlock(mm);
-err_free:
-	bprm->vma = NULL;
-	vm_area_free(vma);
-	return err;
-}
-
 static bool valid_arg_len(struct linux_binprm *bprm, long len)
 {
 	return len <= MAX_ARG_STRLEN;
@@ -351,12 +297,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
 {
 }
 
-static int __bprm_mm_init(struct linux_binprm *bprm)
-{
-	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
-	return 0;
-}
-
 static bool valid_arg_len(struct linux_binprm *bprm, long len)
 {
 	return len <= bprm->p;
@@ -385,9 +325,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
 	task_unlock(current->group_leader);
 
-	err = __bprm_mm_init(bprm);
+#ifndef CONFIG_MMU
+	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
+#else
+	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
 	if (err)
 		goto err;
+#endif
 
 	return 0;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 1794bf6f4dc0..9e09eac0021c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1321,48 +1321,6 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 }
 
-/* Insert vm structure into process list sorted by address
- * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_rwsem is taken here.
- */
-int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
-{
-	unsigned long charged = vma_pages(vma);
-
-
-	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
-		return -ENOMEM;
-
-	if ((vma->vm_flags & VM_ACCOUNT) &&
-	     security_vm_enough_memory_mm(mm, charged))
-		return -ENOMEM;
-
-	/*
-	 * The vm_pgoff of a purely anonymous vma should be irrelevant
-	 * until its first write fault, when page's anon_vma and index
-	 * are set.  But now set the vm_pgoff it will almost certainly
-	 * end up with (unless mremap moves it elsewhere before that
-	 * first wfault), so /proc/pid/maps tells a consistent story.
-	 *
-	 * By setting it to reflect the virtual start address of the
-	 * vma, merges and splits can happen in a seamless way, just
-	 * using the existing file pgoff checks and manipulations.
-	 * Similarly in do_mmap and in do_brk_flags.
-	 */
-	if (vma_is_anonymous(vma)) {
-		BUG_ON(vma->anon_vma);
-		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
-	}
-
-	if (vma_link(mm, vma)) {
-		if (vma->vm_flags & VM_ACCOUNT)
-			vm_unacct_memory(charged);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
diff --git a/mm/vma.c b/mm/vma.c
index 8a6c5e835759..1f2634b29568 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -3052,3 +3052,46 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock)
 	userfaultfd_unmap_complete(mm, &uf);
 	return ret;
 }
+
+
+/* Insert vm structure into process list sorted by address
+ * and into the inode's i_mmap tree.  If vm_file is non-NULL
+ * then i_mmap_rwsem is taken here.
+ */
+int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	unsigned long charged = vma_pages(vma);
+
+
+	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
+		return -ENOMEM;
+
+	if ((vma->vm_flags & VM_ACCOUNT) &&
+	     security_vm_enough_memory_mm(mm, charged))
+		return -ENOMEM;
+
+	/*
+	 * The vm_pgoff of a purely anonymous vma should be irrelevant
+	 * until its first write fault, when page's anon_vma and index
+	 * are set.  But now set the vm_pgoff it will almost certainly
+	 * end up with (unless mremap moves it elsewhere before that
+	 * first wfault), so /proc/pid/maps tells a consistent story.
+	 *
+	 * By setting it to reflect the virtual start address of the
+	 * vma, merges and splits can happen in a seamless way, just
+	 * using the existing file pgoff checks and manipulations.
+	 * Similarly in do_mmap and in do_brk_flags.
+	 */
+	if (vma_is_anonymous(vma)) {
+		BUG_ON(vma->anon_vma);
+		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
+	}
+
+	if (vma_link(mm, vma)) {
+		if (vma->vm_flags & VM_ACCOUNT)
+			vm_unacct_memory(charged);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 1ce3e18f01b7..94307a2e4ab6 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -548,8 +548,12 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
+int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
+
 /* vma_exec.h */
 #ifdef CONFIG_MMU
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p);
 int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
 #endif
 
diff --git a/mm/vma_exec.c b/mm/vma_exec.c
index 6736ae37f748..2dffb02ed6a2 100644
--- a/mm/vma_exec.c
+++ b/mm/vma_exec.c
@@ -90,3 +90,72 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
 	/* Shrink the vma to just the new range */
 	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
 }
+
+/*
+ * Establish the stack VMA in an execve'd process, located temporarily at the
+ * maximum stack address provided by the architecture.
+ *
+ * We later relocate this downwards in relocate_vma_down().
+ *
+ * This function is almost certainly NOT what you want for anything other than
+ * early executable initialisation.
+ *
+ * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
+ * maximum addressable location in the stack (that is capable of storing a
+ * system word of data).
+ */
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p)
+{
+	int err;
+	struct vm_area_struct *vma = vm_area_alloc(mm);
+
+	if (!vma)
+		return -ENOMEM;
+
+	vma_set_anonymous(vma);
+
+	if (mmap_write_lock_killable(mm)) {
+		err = -EINTR;
+		goto err_free;
+	}
+
+	/*
+	 * Need to be called with mmap write lock
+	 * held, to avoid race with ksmd.
+	 */
+	err = ksm_execve(mm);
+	if (err)
+		goto err_ksm;
+
+	/*
+	 * Place the stack at the largest stack address the architecture
+	 * supports. Later, we'll move this to an appropriate place. We don't
+	 * use STACK_TOP because that can depend on attributes which aren't
+	 * configured yet.
+	 */
+	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_end = STACK_TOP_MAX;
+	vma->vm_start = vma->vm_end - PAGE_SIZE;
+	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+
+	err = insert_vm_struct(mm, vma);
+	if (err)
+		goto err;
+
+	mm->stack_vm = mm->total_vm = 1;
+	mmap_write_unlock(mm);
+	*vmap = vma;
+	*top_mem_p = vma->vm_end - sizeof(void *);
+	return 0;
+
+err:
+	ksm_exit(mm);
+err_ksm:
+	mmap_write_unlock(mm);
+err_free:
+	*vmap = NULL;
+	vm_area_free(vma);
+	return err;
+}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 0df19ca0000a..32e990313158 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -56,6 +56,8 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_PFNMAP	0x00000400
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
+#define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
+#define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 #define VM_DONTEXPAND	0x00040000
 #define VM_LOCKONFAULT	0x00080000
 #define VM_ACCOUNT	0x00100000
@@ -70,6 +72,20 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
 
+#ifdef CONFIG_STACK_GROWSUP
+#define VM_STACK	VM_GROWSUP
+#define VM_STACK_EARLY	VM_GROWSDOWN
+#else
+#define VM_STACK	VM_GROWSDOWN
+#define VM_STACK_EARLY	0
+#endif
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_LOW		DEFAULT_MAP_WINDOW
+#define TASK_SIZE_MAX		DEFAULT_MAP_WINDOW
+#define STACK_TOP		TASK_SIZE_LOW
+#define STACK_TOP_MAX		TASK_SIZE_MAX
+
 /* This mask represents all the VMA flag bits used by mlock */
 #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
 
@@ -82,6 +98,10 @@ extern unsigned long dac_mmap_min_addr;
 
 #define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
 
+#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
+#define VM_STACK_FLAGS	(VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)
+#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
+
 #define RLIMIT_STACK		3	/* max stack size */
 #define RLIMIT_MEMLOCK		8	/* max locked-in-memory address space */
 
@@ -1280,4 +1300,16 @@ static inline void free_pgd_range(struct mmu_gather *tlb,
 	(void)ceiling;
 }
 
+static inline int ksm_execve(struct mm_struct *mm)
+{
+	(void)mm;
+
+	return 0;
+}
+
+static inline void ksm_exit(struct mm_struct *mm)
+{
+	(void)mm;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-28 19:12   ` Liam R. Howlett
  2025-04-29  7:04   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-04-28 19:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no
> longer needed anywhere outside of mm.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  fs/exec.c                        | 66 +++---------------------------
>  mm/mmap.c                        | 42 -------------------
>  mm/vma.c                         | 43 ++++++++++++++++++++
>  mm/vma.h                         |  4 ++
>  mm/vma_exec.c                    | 69 ++++++++++++++++++++++++++++++++
>  tools/testing/vma/vma_internal.h | 32 +++++++++++++++
>  6 files changed, 153 insertions(+), 103 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 477bc3f2e966..f9bbcf0016a4 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -245,60 +245,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
>  	flush_cache_page(bprm->vma, pos, page_to_pfn(page));
>  }
>  
> -static int __bprm_mm_init(struct linux_binprm *bprm)
> -{
> -	int err;
> -	struct vm_area_struct *vma = NULL;
> -	struct mm_struct *mm = bprm->mm;
> -
> -	bprm->vma = vma = vm_area_alloc(mm);
> -	if (!vma)
> -		return -ENOMEM;
> -	vma_set_anonymous(vma);
> -
> -	if (mmap_write_lock_killable(mm)) {
> -		err = -EINTR;
> -		goto err_free;
> -	}
> -
> -	/*
> -	 * Need to be called with mmap write lock
> -	 * held, to avoid race with ksmd.
> -	 */
> -	err = ksm_execve(mm);
> -	if (err)
> -		goto err_ksm;
> -
> -	/*
> -	 * Place the stack at the largest stack address the architecture
> -	 * supports. Later, we'll move this to an appropriate place. We don't
> -	 * use STACK_TOP because that can depend on attributes which aren't
> -	 * configured yet.
> -	 */
> -	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
> -	vma->vm_end = STACK_TOP_MAX;
> -	vma->vm_start = vma->vm_end - PAGE_SIZE;
> -	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
> -	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> -
> -	err = insert_vm_struct(mm, vma);
> -	if (err)
> -		goto err;
> -
> -	mm->stack_vm = mm->total_vm = 1;
> -	mmap_write_unlock(mm);
> -	bprm->p = vma->vm_end - sizeof(void *);
> -	return 0;
> -err:
> -	ksm_exit(mm);
> -err_ksm:
> -	mmap_write_unlock(mm);
> -err_free:
> -	bprm->vma = NULL;
> -	vm_area_free(vma);
> -	return err;
> -}
> -
>  static bool valid_arg_len(struct linux_binprm *bprm, long len)
>  {
>  	return len <= MAX_ARG_STRLEN;
> @@ -351,12 +297,6 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
>  {
>  }
>  
> -static int __bprm_mm_init(struct linux_binprm *bprm)
> -{
> -	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
> -	return 0;
> -}
> -
>  static bool valid_arg_len(struct linux_binprm *bprm, long len)
>  {
>  	return len <= bprm->p;
> @@ -385,9 +325,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
>  	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
>  	task_unlock(current->group_leader);
>  
> -	err = __bprm_mm_init(bprm);
> +#ifndef CONFIG_MMU
> +	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
> +#else
> +	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
>  	if (err)
>  		goto err;
> +#endif
>  
>  	return 0;
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1794bf6f4dc0..9e09eac0021c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1321,48 +1321,6 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  }
>  
> -/* Insert vm structure into process list sorted by address
> - * and into the inode's i_mmap tree.  If vm_file is non-NULL
> - * then i_mmap_rwsem is taken here.
> - */
> -int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> -{
> -	unsigned long charged = vma_pages(vma);
> -
> -
> -	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
> -		return -ENOMEM;
> -
> -	if ((vma->vm_flags & VM_ACCOUNT) &&
> -	     security_vm_enough_memory_mm(mm, charged))
> -		return -ENOMEM;
> -
> -	/*
> -	 * The vm_pgoff of a purely anonymous vma should be irrelevant
> -	 * until its first write fault, when page's anon_vma and index
> -	 * are set.  But now set the vm_pgoff it will almost certainly
> -	 * end up with (unless mremap moves it elsewhere before that
> -	 * first wfault), so /proc/pid/maps tells a consistent story.
> -	 *
> -	 * By setting it to reflect the virtual start address of the
> -	 * vma, merges and splits can happen in a seamless way, just
> -	 * using the existing file pgoff checks and manipulations.
> -	 * Similarly in do_mmap and in do_brk_flags.
> -	 */
> -	if (vma_is_anonymous(vma)) {
> -		BUG_ON(vma->anon_vma);
> -		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
> -	}
> -
> -	if (vma_link(mm, vma)) {
> -		if (vma->vm_flags & VM_ACCOUNT)
> -			vm_unacct_memory(charged);
> -		return -ENOMEM;
> -	}
> -
> -	return 0;
> -}
> -
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> diff --git a/mm/vma.c b/mm/vma.c
> index 8a6c5e835759..1f2634b29568 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -3052,3 +3052,46 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock)
>  	userfaultfd_unmap_complete(mm, &uf);
>  	return ret;
>  }
> +
> +
> +/* Insert vm structure into process list sorted by address
> + * and into the inode's i_mmap tree.  If vm_file is non-NULL
> + * then i_mmap_rwsem is taken here.
> + */
> +int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> +{
> +	unsigned long charged = vma_pages(vma);
> +
> +
> +	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
> +		return -ENOMEM;
> +
> +	if ((vma->vm_flags & VM_ACCOUNT) &&
> +	     security_vm_enough_memory_mm(mm, charged))
> +		return -ENOMEM;
> +
> +	/*
> +	 * The vm_pgoff of a purely anonymous vma should be irrelevant
> +	 * until its first write fault, when page's anon_vma and index
> +	 * are set.  But now set the vm_pgoff it will almost certainly
> +	 * end up with (unless mremap moves it elsewhere before that
> +	 * first wfault), so /proc/pid/maps tells a consistent story.
> +	 *
> +	 * By setting it to reflect the virtual start address of the
> +	 * vma, merges and splits can happen in a seamless way, just
> +	 * using the existing file pgoff checks and manipulations.
> +	 * Similarly in do_mmap and in do_brk_flags.
> +	 */
> +	if (vma_is_anonymous(vma)) {
> +		BUG_ON(vma->anon_vma);
> +		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
> +	}
> +
> +	if (vma_link(mm, vma)) {
> +		if (vma->vm_flags & VM_ACCOUNT)
> +			vm_unacct_memory(charged);
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> diff --git a/mm/vma.h b/mm/vma.h
> index 1ce3e18f01b7..94307a2e4ab6 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -548,8 +548,12 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>  
>  int __vm_munmap(unsigned long start, size_t len, bool unlock);
>  
> +int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
> +
>  /* vma_exec.h */
>  #ifdef CONFIG_MMU
> +int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> +			  unsigned long *top_mem_p);
>  int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
>  #endif
>  
> diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> index 6736ae37f748..2dffb02ed6a2 100644
> --- a/mm/vma_exec.c
> +++ b/mm/vma_exec.c
> @@ -90,3 +90,72 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
>  	/* Shrink the vma to just the new range */
>  	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
>  }
> +
> +/*
> + * Establish the stack VMA in an execve'd process, located temporarily at the
> + * maximum stack address provided by the architecture.
> + *
> + * We later relocate this downwards in relocate_vma_down().
> + *
> + * This function is almost certainly NOT what you want for anything other than
> + * early executable initialisation.
> + *
> + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
> + * maximum addressable location in the stack (that is capable of storing a
> + * system word of data).
> + */
> +int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> +			  unsigned long *top_mem_p)
> +{
> +	int err;
> +	struct vm_area_struct *vma = vm_area_alloc(mm);
> +
> +	if (!vma)
> +		return -ENOMEM;
> +
> +	vma_set_anonymous(vma);
> +
> +	if (mmap_write_lock_killable(mm)) {
> +		err = -EINTR;
> +		goto err_free;
> +	}
> +
> +	/*
> +	 * Need to be called with mmap write lock
> +	 * held, to avoid race with ksmd.
> +	 */
> +	err = ksm_execve(mm);
> +	if (err)
> +		goto err_ksm;
> +
> +	/*
> +	 * Place the stack at the largest stack address the architecture
> +	 * supports. Later, we'll move this to an appropriate place. We don't
> +	 * use STACK_TOP because that can depend on attributes which aren't
> +	 * configured yet.
> +	 */
> +	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
> +	vma->vm_end = STACK_TOP_MAX;
> +	vma->vm_start = vma->vm_end - PAGE_SIZE;
> +	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
> +	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> +
> +	err = insert_vm_struct(mm, vma);
> +	if (err)
> +		goto err;
> +
> +	mm->stack_vm = mm->total_vm = 1;
> +	mmap_write_unlock(mm);
> +	*vmap = vma;
> +	*top_mem_p = vma->vm_end - sizeof(void *);
> +	return 0;
> +
> +err:
> +	ksm_exit(mm);
> +err_ksm:
> +	mmap_write_unlock(mm);
> +err_free:
> +	*vmap = NULL;
> +	vm_area_free(vma);
> +	return err;
> +}
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 0df19ca0000a..32e990313158 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -56,6 +56,8 @@ extern unsigned long dac_mmap_min_addr;
>  #define VM_PFNMAP	0x00000400
>  #define VM_LOCKED	0x00002000
>  #define VM_IO           0x00004000
> +#define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
> +#define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  #define VM_DONTEXPAND	0x00040000
>  #define VM_LOCKONFAULT	0x00080000
>  #define VM_ACCOUNT	0x00100000
> @@ -70,6 +72,20 @@ extern unsigned long dac_mmap_min_addr;
>  #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
>  #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
>  
> +#ifdef CONFIG_STACK_GROWSUP
> +#define VM_STACK	VM_GROWSUP
> +#define VM_STACK_EARLY	VM_GROWSDOWN
> +#else
> +#define VM_STACK	VM_GROWSDOWN
> +#define VM_STACK_EARLY	0
> +#endif
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_LOW		DEFAULT_MAP_WINDOW
> +#define TASK_SIZE_MAX		DEFAULT_MAP_WINDOW
> +#define STACK_TOP		TASK_SIZE_LOW
> +#define STACK_TOP_MAX		TASK_SIZE_MAX
> +
>  /* This mask represents all the VMA flag bits used by mlock */
>  #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
>  
> @@ -82,6 +98,10 @@ extern unsigned long dac_mmap_min_addr;
>  
>  #define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
>  
> +#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> +#define VM_STACK_FLAGS	(VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)
> +#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
> +
>  #define RLIMIT_STACK		3	/* max stack size */
>  #define RLIMIT_MEMLOCK		8	/* max locked-in-memory address space */
>  
> @@ -1280,4 +1300,16 @@ static inline void free_pgd_range(struct mmu_gather *tlb,
>  	(void)ceiling;
>  }
>  
> +static inline int ksm_execve(struct mm_struct *mm)
> +{
> +	(void)mm;
> +
> +	return 0;
> +}
> +
> +static inline void ksm_exit(struct mm_struct *mm)
> +{
> +	(void)mm;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
@ 2025-04-29  7:04   ` Vlastimil Babka
  2025-04-29 16:54   ` Kees Cook
  2025-04-29 17:48   ` Pedro Falcato
  3 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-04-29  7:04 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, David Hildenbrand,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 4/28/25 17:28, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no
> longer needed anywhere outside of mm.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
  2025-04-29  7:04   ` Vlastimil Babka
@ 2025-04-29 16:54   ` Kees Cook
  2025-04-29 17:48   ` Pedro Falcato
  3 siblings, 0 replies; 36+ messages in thread
From: Kees Cook @ 2025-04-29 16:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 04:28:15PM +0100, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no
> longer needed anywhere outside of mm.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Kees Cook <kees@kernel.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2025-04-29 16:54   ` Kees Cook
@ 2025-04-29 17:48   ` Pedro Falcato
  3 siblings, 0 replies; 36+ messages in thread
From: Pedro Falcato @ 2025-04-29 17:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Apr 28, 2025 at 04:28:15PM +0100, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no
> longer needed anywhere outside of mm.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
  2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-28 15:28 ` Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
                     ` (3 more replies)
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2025-04-29  7:28 ` [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Vlastimil Babka
  4 siblings, 4 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

This is a key step in our being able to abstract and isolate VMA allocation
and destruction logic.

This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.

We do the same for the nommu version which is substantially simpler.

We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.

While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
---
 kernel/fork.c | 189 ++------------------------------------------------
 mm/internal.h |   2 +
 mm/mmap.c     | 181 +++++++++++++++++++++++++++++++++++++++++++++--
 mm/nommu.c    |   8 +++
 4 files changed, 189 insertions(+), 191 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25..ac9f9267a473 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -112,6 +112,9 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+/* For dup_mmap(). */
+#include "../mm/internal.h"
+
 #include <trace/events/sched.h>
 
 #define CREATE_TRACE_POINTS
@@ -589,7 +592,7 @@ void free_task(struct task_struct *tsk)
 }
 EXPORT_SYMBOL(free_task);
 
-static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	struct file *exe_file;
 
@@ -604,183 +607,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 }
 
 #ifdef CONFIG_MMU
-static __latent_entropy int dup_mmap(struct mm_struct *mm,
-					struct mm_struct *oldmm)
-{
-	struct vm_area_struct *mpnt, *tmp;
-	int retval;
-	unsigned long charge = 0;
-	LIST_HEAD(uf);
-	VMA_ITERATOR(vmi, mm, 0);
-
-	if (mmap_write_lock_killable(oldmm))
-		return -EINTR;
-	flush_cache_dup_mm(oldmm);
-	uprobe_dup_mmap(oldmm, mm);
-	/*
-	 * Not linked in yet - no deadlock potential:
-	 */
-	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
-
-	/* No ordering required: file already has been exposed. */
-	dup_mm_exe_file(mm, oldmm);
-
-	mm->total_vm = oldmm->total_vm;
-	mm->data_vm = oldmm->data_vm;
-	mm->exec_vm = oldmm->exec_vm;
-	mm->stack_vm = oldmm->stack_vm;
-
-	/* Use __mt_dup() to efficiently build an identical maple tree. */
-	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
-	if (unlikely(retval))
-		goto out;
-
-	mt_clear_in_rcu(vmi.mas.tree);
-	for_each_vma(vmi, mpnt) {
-		struct file *file;
-
-		vma_start_write(mpnt);
-		if (mpnt->vm_flags & VM_DONTCOPY) {
-			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
-						    mpnt->vm_end, GFP_KERNEL);
-			if (retval)
-				goto loop_out;
-
-			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
-			continue;
-		}
-		charge = 0;
-		/*
-		 * Don't duplicate many vmas if we've been oom-killed (for
-		 * example)
-		 */
-		if (fatal_signal_pending(current)) {
-			retval = -EINTR;
-			goto loop_out;
-		}
-		if (mpnt->vm_flags & VM_ACCOUNT) {
-			unsigned long len = vma_pages(mpnt);
-
-			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
-				goto fail_nomem;
-			charge = len;
-		}
-		tmp = vm_area_dup(mpnt);
-		if (!tmp)
-			goto fail_nomem;
-
-		/* track_pfn_copy() will later take care of copying internal state. */
-		if (unlikely(tmp->vm_flags & VM_PFNMAP))
-			untrack_pfn_clear(tmp);
-
-		retval = vma_dup_policy(mpnt, tmp);
-		if (retval)
-			goto fail_nomem_policy;
-		tmp->vm_mm = mm;
-		retval = dup_userfaultfd(tmp, &uf);
-		if (retval)
-			goto fail_nomem_anon_vma_fork;
-		if (tmp->vm_flags & VM_WIPEONFORK) {
-			/*
-			 * VM_WIPEONFORK gets a clean slate in the child.
-			 * Don't prepare anon_vma until fault since we don't
-			 * copy page for current vma.
-			 */
-			tmp->anon_vma = NULL;
-		} else if (anon_vma_fork(tmp, mpnt))
-			goto fail_nomem_anon_vma_fork;
-		vm_flags_clear(tmp, VM_LOCKED_MASK);
-		/*
-		 * Copy/update hugetlb private vma information.
-		 */
-		if (is_vm_hugetlb_page(tmp))
-			hugetlb_dup_vma_private(tmp);
-
-		/*
-		 * Link the vma into the MT. After using __mt_dup(), memory
-		 * allocation is not necessary here, so it cannot fail.
-		 */
-		vma_iter_bulk_store(&vmi, tmp);
-
-		mm->map_count++;
-
-		if (tmp->vm_ops && tmp->vm_ops->open)
-			tmp->vm_ops->open(tmp);
-
-		file = tmp->vm_file;
-		if (file) {
-			struct address_space *mapping = file->f_mapping;
-
-			get_file(file);
-			i_mmap_lock_write(mapping);
-			if (vma_is_shared_maywrite(tmp))
-				mapping_allow_writable(mapping);
-			flush_dcache_mmap_lock(mapping);
-			/* insert tmp into the share list, just after mpnt */
-			vma_interval_tree_insert_after(tmp, mpnt,
-					&mapping->i_mmap);
-			flush_dcache_mmap_unlock(mapping);
-			i_mmap_unlock_write(mapping);
-		}
-
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(tmp, mpnt);
-
-		if (retval) {
-			mpnt = vma_next(&vmi);
-			goto loop_out;
-		}
-	}
-	/* a new mm has just been created */
-	retval = arch_dup_mmap(oldmm, mm);
-loop_out:
-	vma_iter_free(&vmi);
-	if (!retval) {
-		mt_set_in_rcu(vmi.mas.tree);
-		ksm_fork(mm, oldmm);
-		khugepaged_fork(mm, oldmm);
-	} else {
-
-		/*
-		 * The entire maple tree has already been duplicated. If the
-		 * mmap duplication fails, mark the failure point with
-		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
-		 * stop releasing VMAs that have not been duplicated after this
-		 * point.
-		 */
-		if (mpnt) {
-			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
-			mas_store(&vmi.mas, XA_ZERO_ENTRY);
-			/* Avoid OOM iterating a broken tree */
-			set_bit(MMF_OOM_SKIP, &mm->flags);
-		}
-		/*
-		 * The mm_struct is going to exit, but the locks will be dropped
-		 * first.  Set the mm_struct as unstable is advisable as it is
-		 * not fully initialised.
-		 */
-		set_bit(MMF_UNSTABLE, &mm->flags);
-	}
-out:
-	mmap_write_unlock(mm);
-	flush_tlb_mm(oldmm);
-	mmap_write_unlock(oldmm);
-	if (!retval)
-		dup_userfaultfd_complete(&uf);
-	else
-		dup_userfaultfd_fail(&uf);
-	return retval;
-
-fail_nomem_anon_vma_fork:
-	mpol_put(vma_policy(tmp));
-fail_nomem_policy:
-	vm_area_free(tmp);
-fail_nomem:
-	retval = -ENOMEM;
-	vm_unacct_memory(charge);
-	goto loop_out;
-}
-
 static inline int mm_alloc_pgd(struct mm_struct *mm)
 {
 	mm->pgd = pgd_alloc(mm);
@@ -794,13 +620,6 @@ static inline void mm_free_pgd(struct mm_struct *mm)
 	pgd_free(mm, mm->pgd);
 }
 #else
-static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
-{
-	mmap_write_lock(oldmm);
-	dup_mm_exe_file(mm, oldmm);
-	mmap_write_unlock(oldmm);
-	return 0;
-}
 #define mm_alloc_pgd(mm)	(0)
 #define mm_free_pgd(mm)
 #endif /* CONFIG_MMU */
diff --git a/mm/internal.h b/mm/internal.h
index 40464f755092..b3e011976f74 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1631,5 +1631,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
 }
 #endif /* CONFIG_PT_RECLAIM */
 
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9e09eac0021c..5259df031e15 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1675,7 +1675,6 @@ static int __meminit init_reserve_notifier(void)
 }
 subsys_initcall(init_reserve_notifier);
 
-#ifdef CONFIG_MMU
 /*
  * Obtain a read lock on mm->mmap_lock, if the specified address is below the
  * start of the VMA, the intent is to perform a write, and it is a
@@ -1719,10 +1718,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
 	mmap_write_downgrade(mm);
 	return true;
 }
-#else
-bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
-				 unsigned long addr, bool write)
+
+__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	return false;
+	struct vm_area_struct *mpnt, *tmp;
+	int retval;
+	unsigned long charge = 0;
+	LIST_HEAD(uf);
+	VMA_ITERATOR(vmi, mm, 0);
+
+	if (mmap_write_lock_killable(oldmm))
+		return -EINTR;
+	flush_cache_dup_mm(oldmm);
+	uprobe_dup_mmap(oldmm, mm);
+	/*
+	 * Not linked in yet - no deadlock potential:
+	 */
+	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
+
+	/* No ordering required: file already has been exposed. */
+	dup_mm_exe_file(mm, oldmm);
+
+	mm->total_vm = oldmm->total_vm;
+	mm->data_vm = oldmm->data_vm;
+	mm->exec_vm = oldmm->exec_vm;
+	mm->stack_vm = oldmm->stack_vm;
+
+	/* Use __mt_dup() to efficiently build an identical maple tree. */
+	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
+	if (unlikely(retval))
+		goto out;
+
+	mt_clear_in_rcu(vmi.mas.tree);
+	for_each_vma(vmi, mpnt) {
+		struct file *file;
+
+		vma_start_write(mpnt);
+		if (mpnt->vm_flags & VM_DONTCOPY) {
+			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
+						    mpnt->vm_end, GFP_KERNEL);
+			if (retval)
+				goto loop_out;
+
+			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+			continue;
+		}
+		charge = 0;
+		/*
+		 * Don't duplicate many vmas if we've been oom-killed (for
+		 * example)
+		 */
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto loop_out;
+		}
+		if (mpnt->vm_flags & VM_ACCOUNT) {
+			unsigned long len = vma_pages(mpnt);
+
+			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
+				goto fail_nomem;
+			charge = len;
+		}
+
+		tmp = vm_area_dup(mpnt);
+		if (!tmp)
+			goto fail_nomem;
+
+		/* track_pfn_copy() will later take care of copying internal state. */
+		if (unlikely(tmp->vm_flags & VM_PFNMAP))
+			untrack_pfn_clear(tmp);
+
+		retval = vma_dup_policy(mpnt, tmp);
+		if (retval)
+			goto fail_nomem_policy;
+		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
+		if (tmp->vm_flags & VM_WIPEONFORK) {
+			/*
+			 * VM_WIPEONFORK gets a clean slate in the child.
+			 * Don't prepare anon_vma until fault since we don't
+			 * copy page for current vma.
+			 */
+			tmp->anon_vma = NULL;
+		} else if (anon_vma_fork(tmp, mpnt))
+			goto fail_nomem_anon_vma_fork;
+		vm_flags_clear(tmp, VM_LOCKED_MASK);
+		/*
+		 * Copy/update hugetlb private vma information.
+		 */
+		if (is_vm_hugetlb_page(tmp))
+			hugetlb_dup_vma_private(tmp);
+
+		/*
+		 * Link the vma into the MT. After using __mt_dup(), memory
+		 * allocation is not necessary here, so it cannot fail.
+		 */
+		vma_iter_bulk_store(&vmi, tmp);
+
+		mm->map_count++;
+
+		if (tmp->vm_ops && tmp->vm_ops->open)
+			tmp->vm_ops->open(tmp);
+
+		file = tmp->vm_file;
+		if (file) {
+			struct address_space *mapping = file->f_mapping;
+
+			get_file(file);
+			i_mmap_lock_write(mapping);
+			if (vma_is_shared_maywrite(tmp))
+				mapping_allow_writable(mapping);
+			flush_dcache_mmap_lock(mapping);
+			/* insert tmp into the share list, just after mpnt */
+			vma_interval_tree_insert_after(tmp, mpnt,
+					&mapping->i_mmap);
+			flush_dcache_mmap_unlock(mapping);
+			i_mmap_unlock_write(mapping);
+		}
+
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			retval = copy_page_range(tmp, mpnt);
+
+		if (retval) {
+			mpnt = vma_next(&vmi);
+			goto loop_out;
+		}
+	}
+	/* a new mm has just been created */
+	retval = arch_dup_mmap(oldmm, mm);
+loop_out:
+	vma_iter_free(&vmi);
+	if (!retval) {
+		mt_set_in_rcu(vmi.mas.tree);
+		ksm_fork(mm, oldmm);
+		khugepaged_fork(mm, oldmm);
+	} else {
+
+		/*
+		 * The entire maple tree has already been duplicated. If the
+		 * mmap duplication fails, mark the failure point with
+		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
+		 * stop releasing VMAs that have not been duplicated after this
+		 * point.
+		 */
+		if (mpnt) {
+			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
+			mas_store(&vmi.mas, XA_ZERO_ENTRY);
+			/* Avoid OOM iterating a broken tree */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		/*
+		 * The mm_struct is going to exit, but the locks will be dropped
+		 * first.  Set the mm_struct as unstable is advisable as it is
+		 * not fully initialised.
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+	}
+out:
+	mmap_write_unlock(mm);
+	flush_tlb_mm(oldmm);
+	mmap_write_unlock(oldmm);
+	if (!retval)
+		dup_userfaultfd_complete(&uf);
+	else
+		dup_userfaultfd_fail(&uf);
+	return retval;
+
+fail_nomem_anon_vma_fork:
+	mpol_put(vma_policy(tmp));
+fail_nomem_policy:
+	vm_area_free(tmp);
+fail_nomem:
+	retval = -ENOMEM;
+	vm_unacct_memory(charge);
+	goto loop_out;
 }
-#endif
diff --git a/mm/nommu.c b/mm/nommu.c
index 2b4d304c6445..a142fc258d39 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1874,3 +1874,11 @@ static int __meminit init_admin_reserve(void)
 	return 0;
 }
 subsys_initcall(init_admin_reserve);
+
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	mmap_write_lock(oldmm);
+	dup_mm_exe_file(mm, oldmm);
+	mmap_write_unlock(oldmm);
+	return 0;
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
@ 2025-04-28 19:12   ` Liam R. Howlett
  2025-04-28 23:31     ` Suren Baghdasaryan
  2025-04-29  7:12   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-04-28 19:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> This is a key step in our being able to abstract and isolate VMA allocation
> and destruction logic.
> 
> This function is the last one where vm_area_free() and vm_area_dup() are
> directly referenced outside of mmap, so having this in mm allows us to
> isolate these.
> 
> We do the same for the nommu version which is substantially simpler.
> 
> We place the declaration for dup_mmap() in mm/internal.h and have
> kernel/fork.c import this in order to prevent improper use of this
> functionality elsewhere in the kernel.
> 
> While we're here, we remove the useless #ifdef CONFIG_MMU check around
> mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> CONFIG_MMU is set.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Suggested-by: Pedro Falcato <pfalcato@suse.de>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  kernel/fork.c | 189 ++------------------------------------------------
>  mm/internal.h |   2 +
>  mm/mmap.c     | 181 +++++++++++++++++++++++++++++++++++++++++++++--
>  mm/nommu.c    |   8 +++
>  4 files changed, 189 insertions(+), 191 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25..ac9f9267a473 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -112,6 +112,9 @@
>  #include <asm/cacheflush.h>
>  #include <asm/tlbflush.h>
>  
> +/* For dup_mmap(). */
> +#include "../mm/internal.h"
> +
>  #include <trace/events/sched.h>
>  
>  #define CREATE_TRACE_POINTS
> @@ -589,7 +592,7 @@ void free_task(struct task_struct *tsk)
>  }
>  EXPORT_SYMBOL(free_task);
>  
> -static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
>  	struct file *exe_file;
>  
> @@ -604,183 +607,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
>  }
>  
>  #ifdef CONFIG_MMU
> -static __latent_entropy int dup_mmap(struct mm_struct *mm,
> -					struct mm_struct *oldmm)
> -{
> -	struct vm_area_struct *mpnt, *tmp;
> -	int retval;
> -	unsigned long charge = 0;
> -	LIST_HEAD(uf);
> -	VMA_ITERATOR(vmi, mm, 0);
> -
> -	if (mmap_write_lock_killable(oldmm))
> -		return -EINTR;
> -	flush_cache_dup_mm(oldmm);
> -	uprobe_dup_mmap(oldmm, mm);
> -	/*
> -	 * Not linked in yet - no deadlock potential:
> -	 */
> -	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
> -
> -	/* No ordering required: file already has been exposed. */
> -	dup_mm_exe_file(mm, oldmm);
> -
> -	mm->total_vm = oldmm->total_vm;
> -	mm->data_vm = oldmm->data_vm;
> -	mm->exec_vm = oldmm->exec_vm;
> -	mm->stack_vm = oldmm->stack_vm;
> -
> -	/* Use __mt_dup() to efficiently build an identical maple tree. */
> -	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> -	if (unlikely(retval))
> -		goto out;
> -
> -	mt_clear_in_rcu(vmi.mas.tree);
> -	for_each_vma(vmi, mpnt) {
> -		struct file *file;
> -
> -		vma_start_write(mpnt);
> -		if (mpnt->vm_flags & VM_DONTCOPY) {
> -			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> -						    mpnt->vm_end, GFP_KERNEL);
> -			if (retval)
> -				goto loop_out;
> -
> -			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> -			continue;
> -		}
> -		charge = 0;
> -		/*
> -		 * Don't duplicate many vmas if we've been oom-killed (for
> -		 * example)
> -		 */
> -		if (fatal_signal_pending(current)) {
> -			retval = -EINTR;
> -			goto loop_out;
> -		}
> -		if (mpnt->vm_flags & VM_ACCOUNT) {
> -			unsigned long len = vma_pages(mpnt);
> -
> -			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
> -				goto fail_nomem;
> -			charge = len;
> -		}
> -		tmp = vm_area_dup(mpnt);
> -		if (!tmp)
> -			goto fail_nomem;
> -
> -		/* track_pfn_copy() will later take care of copying internal state. */
> -		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> -			untrack_pfn_clear(tmp);
> -
> -		retval = vma_dup_policy(mpnt, tmp);
> -		if (retval)
> -			goto fail_nomem_policy;
> -		tmp->vm_mm = mm;
> -		retval = dup_userfaultfd(tmp, &uf);
> -		if (retval)
> -			goto fail_nomem_anon_vma_fork;
> -		if (tmp->vm_flags & VM_WIPEONFORK) {
> -			/*
> -			 * VM_WIPEONFORK gets a clean slate in the child.
> -			 * Don't prepare anon_vma until fault since we don't
> -			 * copy page for current vma.
> -			 */
> -			tmp->anon_vma = NULL;
> -		} else if (anon_vma_fork(tmp, mpnt))
> -			goto fail_nomem_anon_vma_fork;
> -		vm_flags_clear(tmp, VM_LOCKED_MASK);
> -		/*
> -		 * Copy/update hugetlb private vma information.
> -		 */
> -		if (is_vm_hugetlb_page(tmp))
> -			hugetlb_dup_vma_private(tmp);
> -
> -		/*
> -		 * Link the vma into the MT. After using __mt_dup(), memory
> -		 * allocation is not necessary here, so it cannot fail.
> -		 */
> -		vma_iter_bulk_store(&vmi, tmp);
> -
> -		mm->map_count++;
> -
> -		if (tmp->vm_ops && tmp->vm_ops->open)
> -			tmp->vm_ops->open(tmp);
> -
> -		file = tmp->vm_file;
> -		if (file) {
> -			struct address_space *mapping = file->f_mapping;
> -
> -			get_file(file);
> -			i_mmap_lock_write(mapping);
> -			if (vma_is_shared_maywrite(tmp))
> -				mapping_allow_writable(mapping);
> -			flush_dcache_mmap_lock(mapping);
> -			/* insert tmp into the share list, just after mpnt */
> -			vma_interval_tree_insert_after(tmp, mpnt,
> -					&mapping->i_mmap);
> -			flush_dcache_mmap_unlock(mapping);
> -			i_mmap_unlock_write(mapping);
> -		}
> -
> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
> -			retval = copy_page_range(tmp, mpnt);
> -
> -		if (retval) {
> -			mpnt = vma_next(&vmi);
> -			goto loop_out;
> -		}
> -	}
> -	/* a new mm has just been created */
> -	retval = arch_dup_mmap(oldmm, mm);
> -loop_out:
> -	vma_iter_free(&vmi);
> -	if (!retval) {
> -		mt_set_in_rcu(vmi.mas.tree);
> -		ksm_fork(mm, oldmm);
> -		khugepaged_fork(mm, oldmm);
> -	} else {
> -
> -		/*
> -		 * The entire maple tree has already been duplicated. If the
> -		 * mmap duplication fails, mark the failure point with
> -		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> -		 * stop releasing VMAs that have not been duplicated after this
> -		 * point.
> -		 */
> -		if (mpnt) {
> -			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> -			mas_store(&vmi.mas, XA_ZERO_ENTRY);
> -			/* Avoid OOM iterating a broken tree */
> -			set_bit(MMF_OOM_SKIP, &mm->flags);
> -		}
> -		/*
> -		 * The mm_struct is going to exit, but the locks will be dropped
> -		 * first.  Set the mm_struct as unstable is advisable as it is
> -		 * not fully initialised.
> -		 */
> -		set_bit(MMF_UNSTABLE, &mm->flags);
> -	}
> -out:
> -	mmap_write_unlock(mm);
> -	flush_tlb_mm(oldmm);
> -	mmap_write_unlock(oldmm);
> -	if (!retval)
> -		dup_userfaultfd_complete(&uf);
> -	else
> -		dup_userfaultfd_fail(&uf);
> -	return retval;
> -
> -fail_nomem_anon_vma_fork:
> -	mpol_put(vma_policy(tmp));
> -fail_nomem_policy:
> -	vm_area_free(tmp);
> -fail_nomem:
> -	retval = -ENOMEM;
> -	vm_unacct_memory(charge);
> -	goto loop_out;
> -}
> -
>  static inline int mm_alloc_pgd(struct mm_struct *mm)
>  {
>  	mm->pgd = pgd_alloc(mm);
> @@ -794,13 +620,6 @@ static inline void mm_free_pgd(struct mm_struct *mm)
>  	pgd_free(mm, mm->pgd);
>  }
>  #else
> -static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> -{
> -	mmap_write_lock(oldmm);
> -	dup_mm_exe_file(mm, oldmm);
> -	mmap_write_unlock(oldmm);
> -	return 0;
> -}
>  #define mm_alloc_pgd(mm)	(0)
>  #define mm_free_pgd(mm)
>  #endif /* CONFIG_MMU */
> diff --git a/mm/internal.h b/mm/internal.h
> index 40464f755092..b3e011976f74 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1631,5 +1631,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
>  }
>  #endif /* CONFIG_PT_RECLAIM */
>  
> +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
>  
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9e09eac0021c..5259df031e15 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1675,7 +1675,6 @@ static int __meminit init_reserve_notifier(void)
>  }
>  subsys_initcall(init_reserve_notifier);
>  
> -#ifdef CONFIG_MMU
>  /*
>   * Obtain a read lock on mm->mmap_lock, if the specified address is below the
>   * start of the VMA, the intent is to perform a write, and it is a
> @@ -1719,10 +1718,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
>  	mmap_write_downgrade(mm);
>  	return true;
>  }
> -#else
> -bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> -				 unsigned long addr, bool write)
> +
> +__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> -	return false;
> +	struct vm_area_struct *mpnt, *tmp;
> +	int retval;
> +	unsigned long charge = 0;
> +	LIST_HEAD(uf);
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	if (mmap_write_lock_killable(oldmm))
> +		return -EINTR;
> +	flush_cache_dup_mm(oldmm);
> +	uprobe_dup_mmap(oldmm, mm);
> +	/*
> +	 * Not linked in yet - no deadlock potential:
> +	 */
> +	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
> +
> +	/* No ordering required: file already has been exposed. */
> +	dup_mm_exe_file(mm, oldmm);
> +
> +	mm->total_vm = oldmm->total_vm;
> +	mm->data_vm = oldmm->data_vm;
> +	mm->exec_vm = oldmm->exec_vm;
> +	mm->stack_vm = oldmm->stack_vm;
> +
> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> +	if (unlikely(retval))
> +		goto out;
> +
> +	mt_clear_in_rcu(vmi.mas.tree);
> +	for_each_vma(vmi, mpnt) {
> +		struct file *file;
> +
> +		vma_start_write(mpnt);
> +		if (mpnt->vm_flags & VM_DONTCOPY) {
> +			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> +						    mpnt->vm_end, GFP_KERNEL);
> +			if (retval)
> +				goto loop_out;
> +
> +			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +			continue;
> +		}
> +		charge = 0;
> +		/*
> +		 * Don't duplicate many vmas if we've been oom-killed (for
> +		 * example)
> +		 */
> +		if (fatal_signal_pending(current)) {
> +			retval = -EINTR;
> +			goto loop_out;
> +		}
> +		if (mpnt->vm_flags & VM_ACCOUNT) {
> +			unsigned long len = vma_pages(mpnt);
> +
> +			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
> +				goto fail_nomem;
> +			charge = len;
> +		}
> +
> +		tmp = vm_area_dup(mpnt);
> +		if (!tmp)
> +			goto fail_nomem;
> +
> +		/* track_pfn_copy() will later take care of copying internal state. */
> +		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> +			untrack_pfn_clear(tmp);
> +
> +		retval = vma_dup_policy(mpnt, tmp);
> +		if (retval)
> +			goto fail_nomem_policy;
> +		tmp->vm_mm = mm;
> +		retval = dup_userfaultfd(tmp, &uf);
> +		if (retval)
> +			goto fail_nomem_anon_vma_fork;
> +		if (tmp->vm_flags & VM_WIPEONFORK) {
> +			/*
> +			 * VM_WIPEONFORK gets a clean slate in the child.
> +			 * Don't prepare anon_vma until fault since we don't
> +			 * copy page for current vma.
> +			 */
> +			tmp->anon_vma = NULL;
> +		} else if (anon_vma_fork(tmp, mpnt))
> +			goto fail_nomem_anon_vma_fork;
> +		vm_flags_clear(tmp, VM_LOCKED_MASK);
> +		/*
> +		 * Copy/update hugetlb private vma information.
> +		 */
> +		if (is_vm_hugetlb_page(tmp))
> +			hugetlb_dup_vma_private(tmp);
> +
> +		/*
> +		 * Link the vma into the MT. After using __mt_dup(), memory
> +		 * allocation is not necessary here, so it cannot fail.
> +		 */
> +		vma_iter_bulk_store(&vmi, tmp);
> +
> +		mm->map_count++;
> +
> +		if (tmp->vm_ops && tmp->vm_ops->open)
> +			tmp->vm_ops->open(tmp);
> +
> +		file = tmp->vm_file;
> +		if (file) {
> +			struct address_space *mapping = file->f_mapping;
> +
> +			get_file(file);
> +			i_mmap_lock_write(mapping);
> +			if (vma_is_shared_maywrite(tmp))
> +				mapping_allow_writable(mapping);
> +			flush_dcache_mmap_lock(mapping);
> +			/* insert tmp into the share list, just after mpnt */
> +			vma_interval_tree_insert_after(tmp, mpnt,
> +					&mapping->i_mmap);
> +			flush_dcache_mmap_unlock(mapping);
> +			i_mmap_unlock_write(mapping);
> +		}
> +
> +		if (!(tmp->vm_flags & VM_WIPEONFORK))
> +			retval = copy_page_range(tmp, mpnt);
> +
> +		if (retval) {
> +			mpnt = vma_next(&vmi);
> +			goto loop_out;
> +		}
> +	}
> +	/* a new mm has just been created */
> +	retval = arch_dup_mmap(oldmm, mm);
> +loop_out:
> +	vma_iter_free(&vmi);
> +	if (!retval) {
> +		mt_set_in_rcu(vmi.mas.tree);
> +		ksm_fork(mm, oldmm);
> +		khugepaged_fork(mm, oldmm);
> +	} else {
> +
> +		/*
> +		 * The entire maple tree has already been duplicated. If the
> +		 * mmap duplication fails, mark the failure point with
> +		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> +		 * stop releasing VMAs that have not been duplicated after this
> +		 * point.
> +		 */
> +		if (mpnt) {
> +			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> +			mas_store(&vmi.mas, XA_ZERO_ENTRY);
> +			/* Avoid OOM iterating a broken tree */
> +			set_bit(MMF_OOM_SKIP, &mm->flags);
> +		}
> +		/*
> +		 * The mm_struct is going to exit, but the locks will be dropped
> +		 * first.  Set the mm_struct as unstable is advisable as it is
> +		 * not fully initialised.
> +		 */
> +		set_bit(MMF_UNSTABLE, &mm->flags);
> +	}
> +out:
> +	mmap_write_unlock(mm);
> +	flush_tlb_mm(oldmm);
> +	mmap_write_unlock(oldmm);
> +	if (!retval)
> +		dup_userfaultfd_complete(&uf);
> +	else
> +		dup_userfaultfd_fail(&uf);
> +	return retval;
> +
> +fail_nomem_anon_vma_fork:
> +	mpol_put(vma_policy(tmp));
> +fail_nomem_policy:
> +	vm_area_free(tmp);
> +fail_nomem:
> +	retval = -ENOMEM;
> +	vm_unacct_memory(charge);
> +	goto loop_out;
>  }
> -#endif
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 2b4d304c6445..a142fc258d39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1874,3 +1874,11 @@ static int __meminit init_admin_reserve(void)
>  	return 0;
>  }
>  subsys_initcall(init_admin_reserve);
> +
> +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> +{
> +	mmap_write_lock(oldmm);
> +	dup_mm_exe_file(mm, oldmm);
> +	mmap_write_unlock(oldmm);
> +	return 0;
> +}
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 19:12   ` Liam R. Howlett
@ 2025-04-28 23:31     ` Suren Baghdasaryan
  0 siblings, 0 replies; 36+ messages in thread
From: Suren Baghdasaryan @ 2025-04-28 23:31 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton, Vlastimil Babka,
	Jann Horn, Pedro Falcato, David Hildenbrand, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, Suren Baghdasaryan,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 12:13 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> > This is a key step in our being able to abstract and isolate VMA allocation
> > and destruction logic.
> >
> > This function is the last one where vm_area_free() and vm_area_dup() are
> > directly referenced outside of mmap, so having this in mm allows us to
> > isolate these.
> >
> > We do the same for the nommu version which is substantially simpler.
> >
> > We place the declaration for dup_mmap() in mm/internal.h and have
> > kernel/fork.c import this in order to prevent improper use of this
> > functionality elsewhere in the kernel.
> >
> > While we're here, we remove the useless #ifdef CONFIG_MMU check around
> > mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> > CONFIG_MMU is set.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Suggested-by: Pedro Falcato <pfalcato@suse.de>
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> > ---
> >  kernel/fork.c | 189 ++------------------------------------------------
> >  mm/internal.h |   2 +
> >  mm/mmap.c     | 181 +++++++++++++++++++++++++++++++++++++++++++++--
> >  mm/nommu.c    |   8 +++
> >  4 files changed, 189 insertions(+), 191 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 168681fc4b25..ac9f9267a473 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -112,6 +112,9 @@
> >  #include <asm/cacheflush.h>
> >  #include <asm/tlbflush.h>
> >
> > +/* For dup_mmap(). */
> > +#include "../mm/internal.h"
> > +
> >  #include <trace/events/sched.h>
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -589,7 +592,7 @@ void free_task(struct task_struct *tsk)
> >  }
> >  EXPORT_SYMBOL(free_task);
> >
> > -static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> > +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> >       struct file *exe_file;
> >
> > @@ -604,183 +607,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> >  }
> >
> >  #ifdef CONFIG_MMU
> > -static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > -                                     struct mm_struct *oldmm)
> > -{
> > -     struct vm_area_struct *mpnt, *tmp;
> > -     int retval;
> > -     unsigned long charge = 0;
> > -     LIST_HEAD(uf);
> > -     VMA_ITERATOR(vmi, mm, 0);
> > -
> > -     if (mmap_write_lock_killable(oldmm))
> > -             return -EINTR;
> > -     flush_cache_dup_mm(oldmm);
> > -     uprobe_dup_mmap(oldmm, mm);
> > -     /*
> > -      * Not linked in yet - no deadlock potential:
> > -      */
> > -     mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
> > -
> > -     /* No ordering required: file already has been exposed. */
> > -     dup_mm_exe_file(mm, oldmm);
> > -
> > -     mm->total_vm = oldmm->total_vm;
> > -     mm->data_vm = oldmm->data_vm;
> > -     mm->exec_vm = oldmm->exec_vm;
> > -     mm->stack_vm = oldmm->stack_vm;
> > -
> > -     /* Use __mt_dup() to efficiently build an identical maple tree. */
> > -     retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> > -     if (unlikely(retval))
> > -             goto out;
> > -
> > -     mt_clear_in_rcu(vmi.mas.tree);
> > -     for_each_vma(vmi, mpnt) {
> > -             struct file *file;
> > -
> > -             vma_start_write(mpnt);
> > -             if (mpnt->vm_flags & VM_DONTCOPY) {
> > -                     retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > -                                                 mpnt->vm_end, GFP_KERNEL);
> > -                     if (retval)
> > -                             goto loop_out;
> > -
> > -                     vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > -                     continue;
> > -             }
> > -             charge = 0;
> > -             /*
> > -              * Don't duplicate many vmas if we've been oom-killed (for
> > -              * example)
> > -              */
> > -             if (fatal_signal_pending(current)) {
> > -                     retval = -EINTR;
> > -                     goto loop_out;
> > -             }
> > -             if (mpnt->vm_flags & VM_ACCOUNT) {
> > -                     unsigned long len = vma_pages(mpnt);
> > -
> > -                     if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
> > -                             goto fail_nomem;
> > -                     charge = len;
> > -             }
> > -             tmp = vm_area_dup(mpnt);
> > -             if (!tmp)
> > -                     goto fail_nomem;
> > -
> > -             /* track_pfn_copy() will later take care of copying internal state. */
> > -             if (unlikely(tmp->vm_flags & VM_PFNMAP))
> > -                     untrack_pfn_clear(tmp);
> > -
> > -             retval = vma_dup_policy(mpnt, tmp);
> > -             if (retval)
> > -                     goto fail_nomem_policy;
> > -             tmp->vm_mm = mm;
> > -             retval = dup_userfaultfd(tmp, &uf);
> > -             if (retval)
> > -                     goto fail_nomem_anon_vma_fork;
> > -             if (tmp->vm_flags & VM_WIPEONFORK) {
> > -                     /*
> > -                      * VM_WIPEONFORK gets a clean slate in the child.
> > -                      * Don't prepare anon_vma until fault since we don't
> > -                      * copy page for current vma.
> > -                      */
> > -                     tmp->anon_vma = NULL;
> > -             } else if (anon_vma_fork(tmp, mpnt))
> > -                     goto fail_nomem_anon_vma_fork;
> > -             vm_flags_clear(tmp, VM_LOCKED_MASK);
> > -             /*
> > -              * Copy/update hugetlb private vma information.
> > -              */
> > -             if (is_vm_hugetlb_page(tmp))
> > -                     hugetlb_dup_vma_private(tmp);
> > -
> > -             /*
> > -              * Link the vma into the MT. After using __mt_dup(), memory
> > -              * allocation is not necessary here, so it cannot fail.
> > -              */
> > -             vma_iter_bulk_store(&vmi, tmp);
> > -
> > -             mm->map_count++;
> > -
> > -             if (tmp->vm_ops && tmp->vm_ops->open)
> > -                     tmp->vm_ops->open(tmp);
> > -
> > -             file = tmp->vm_file;
> > -             if (file) {
> > -                     struct address_space *mapping = file->f_mapping;
> > -
> > -                     get_file(file);
> > -                     i_mmap_lock_write(mapping);
> > -                     if (vma_is_shared_maywrite(tmp))
> > -                             mapping_allow_writable(mapping);
> > -                     flush_dcache_mmap_lock(mapping);
> > -                     /* insert tmp into the share list, just after mpnt */
> > -                     vma_interval_tree_insert_after(tmp, mpnt,
> > -                                     &mapping->i_mmap);
> > -                     flush_dcache_mmap_unlock(mapping);
> > -                     i_mmap_unlock_write(mapping);
> > -             }
> > -
> > -             if (!(tmp->vm_flags & VM_WIPEONFORK))
> > -                     retval = copy_page_range(tmp, mpnt);
> > -
> > -             if (retval) {
> > -                     mpnt = vma_next(&vmi);
> > -                     goto loop_out;
> > -             }
> > -     }
> > -     /* a new mm has just been created */
> > -     retval = arch_dup_mmap(oldmm, mm);
> > -loop_out:
> > -     vma_iter_free(&vmi);
> > -     if (!retval) {
> > -             mt_set_in_rcu(vmi.mas.tree);
> > -             ksm_fork(mm, oldmm);
> > -             khugepaged_fork(mm, oldmm);
> > -     } else {
> > -
> > -             /*
> > -              * The entire maple tree has already been duplicated. If the
> > -              * mmap duplication fails, mark the failure point with
> > -              * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> > -              * stop releasing VMAs that have not been duplicated after this
> > -              * point.
> > -              */
> > -             if (mpnt) {
> > -                     mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > -                     mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > -                     /* Avoid OOM iterating a broken tree */
> > -                     set_bit(MMF_OOM_SKIP, &mm->flags);
> > -             }
> > -             /*
> > -              * The mm_struct is going to exit, but the locks will be dropped
> > -              * first.  Set the mm_struct as unstable is advisable as it is
> > -              * not fully initialised.
> > -              */
> > -             set_bit(MMF_UNSTABLE, &mm->flags);
> > -     }
> > -out:
> > -     mmap_write_unlock(mm);
> > -     flush_tlb_mm(oldmm);
> > -     mmap_write_unlock(oldmm);
> > -     if (!retval)
> > -             dup_userfaultfd_complete(&uf);
> > -     else
> > -             dup_userfaultfd_fail(&uf);
> > -     return retval;
> > -
> > -fail_nomem_anon_vma_fork:
> > -     mpol_put(vma_policy(tmp));
> > -fail_nomem_policy:
> > -     vm_area_free(tmp);
> > -fail_nomem:
> > -     retval = -ENOMEM;
> > -     vm_unacct_memory(charge);
> > -     goto loop_out;
> > -}
> > -
> >  static inline int mm_alloc_pgd(struct mm_struct *mm)
> >  {
> >       mm->pgd = pgd_alloc(mm);
> > @@ -794,13 +620,6 @@ static inline void mm_free_pgd(struct mm_struct *mm)
> >       pgd_free(mm, mm->pgd);
> >  }
> >  #else
> > -static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > -{
> > -     mmap_write_lock(oldmm);
> > -     dup_mm_exe_file(mm, oldmm);
> > -     mmap_write_unlock(oldmm);
> > -     return 0;
> > -}
> >  #define mm_alloc_pgd(mm)     (0)
> >  #define mm_free_pgd(mm)
> >  #endif /* CONFIG_MMU */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 40464f755092..b3e011976f74 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1631,5 +1631,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> >  }
> >  #endif /* CONFIG_PT_RECLAIM */
> >
> > +void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> > +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >
> >  #endif       /* __MM_INTERNAL_H */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9e09eac0021c..5259df031e15 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1675,7 +1675,6 @@ static int __meminit init_reserve_notifier(void)
> >  }
> >  subsys_initcall(init_reserve_notifier);
> >
> > -#ifdef CONFIG_MMU
> >  /*
> >   * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> >   * start of the VMA, the intent is to perform a write, and it is a
> > @@ -1719,10 +1718,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
> >       mmap_write_downgrade(mm);
> >       return true;
> >  }
> > -#else
> > -bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> > -                              unsigned long addr, bool write)
> > +
> > +__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > -     return false;
> > +     struct vm_area_struct *mpnt, *tmp;
> > +     int retval;
> > +     unsigned long charge = 0;
> > +     LIST_HEAD(uf);
> > +     VMA_ITERATOR(vmi, mm, 0);
> > +
> > +     if (mmap_write_lock_killable(oldmm))
> > +             return -EINTR;
> > +     flush_cache_dup_mm(oldmm);
> > +     uprobe_dup_mmap(oldmm, mm);
> > +     /*
> > +      * Not linked in yet - no deadlock potential:
> > +      */
> > +     mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
> > +
> > +     /* No ordering required: file already has been exposed. */
> > +     dup_mm_exe_file(mm, oldmm);
> > +
> > +     mm->total_vm = oldmm->total_vm;
> > +     mm->data_vm = oldmm->data_vm;
> > +     mm->exec_vm = oldmm->exec_vm;
> > +     mm->stack_vm = oldmm->stack_vm;
> > +
> > +     /* Use __mt_dup() to efficiently build an identical maple tree. */
> > +     retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> > +     if (unlikely(retval))
> > +             goto out;
> > +
> > +     mt_clear_in_rcu(vmi.mas.tree);
> > +     for_each_vma(vmi, mpnt) {
> > +             struct file *file;
> > +
> > +             vma_start_write(mpnt);
> > +             if (mpnt->vm_flags & VM_DONTCOPY) {
> > +                     retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > +                                                 mpnt->vm_end, GFP_KERNEL);
> > +                     if (retval)
> > +                             goto loop_out;
> > +
> > +                     vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > +                     continue;
> > +             }
> > +             charge = 0;
> > +             /*
> > +              * Don't duplicate many vmas if we've been oom-killed (for
> > +              * example)
> > +              */
> > +             if (fatal_signal_pending(current)) {
> > +                     retval = -EINTR;
> > +                     goto loop_out;
> > +             }
> > +             if (mpnt->vm_flags & VM_ACCOUNT) {
> > +                     unsigned long len = vma_pages(mpnt);
> > +
> > +                     if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
> > +                             goto fail_nomem;
> > +                     charge = len;
> > +             }
> > +
> > +             tmp = vm_area_dup(mpnt);
> > +             if (!tmp)
> > +                     goto fail_nomem;
> > +
> > +             /* track_pfn_copy() will later take care of copying internal state. */
> > +             if (unlikely(tmp->vm_flags & VM_PFNMAP))
> > +                     untrack_pfn_clear(tmp);
> > +
> > +             retval = vma_dup_policy(mpnt, tmp);
> > +             if (retval)
> > +                     goto fail_nomem_policy;
> > +             tmp->vm_mm = mm;
> > +             retval = dup_userfaultfd(tmp, &uf);
> > +             if (retval)
> > +                     goto fail_nomem_anon_vma_fork;
> > +             if (tmp->vm_flags & VM_WIPEONFORK) {
> > +                     /*
> > +                      * VM_WIPEONFORK gets a clean slate in the child.
> > +                      * Don't prepare anon_vma until fault since we don't
> > +                      * copy page for current vma.
> > +                      */
> > +                     tmp->anon_vma = NULL;
> > +             } else if (anon_vma_fork(tmp, mpnt))
> > +                     goto fail_nomem_anon_vma_fork;
> > +             vm_flags_clear(tmp, VM_LOCKED_MASK);
> > +             /*
> > +              * Copy/update hugetlb private vma information.
> > +              */
> > +             if (is_vm_hugetlb_page(tmp))
> > +                     hugetlb_dup_vma_private(tmp);
> > +
> > +             /*
> > +              * Link the vma into the MT. After using __mt_dup(), memory
> > +              * allocation is not necessary here, so it cannot fail.
> > +              */
> > +             vma_iter_bulk_store(&vmi, tmp);
> > +
> > +             mm->map_count++;
> > +
> > +             if (tmp->vm_ops && tmp->vm_ops->open)
> > +                     tmp->vm_ops->open(tmp);
> > +
> > +             file = tmp->vm_file;
> > +             if (file) {
> > +                     struct address_space *mapping = file->f_mapping;
> > +
> > +                     get_file(file);
> > +                     i_mmap_lock_write(mapping);
> > +                     if (vma_is_shared_maywrite(tmp))
> > +                             mapping_allow_writable(mapping);
> > +                     flush_dcache_mmap_lock(mapping);
> > +                     /* insert tmp into the share list, just after mpnt */
> > +                     vma_interval_tree_insert_after(tmp, mpnt,
> > +                                     &mapping->i_mmap);
> > +                     flush_dcache_mmap_unlock(mapping);
> > +                     i_mmap_unlock_write(mapping);
> > +             }
> > +
> > +             if (!(tmp->vm_flags & VM_WIPEONFORK))
> > +                     retval = copy_page_range(tmp, mpnt);
> > +
> > +             if (retval) {
> > +                     mpnt = vma_next(&vmi);
> > +                     goto loop_out;
> > +             }
> > +     }
> > +     /* a new mm has just been created */
> > +     retval = arch_dup_mmap(oldmm, mm);
> > +loop_out:
> > +     vma_iter_free(&vmi);
> > +     if (!retval) {
> > +             mt_set_in_rcu(vmi.mas.tree);
> > +             ksm_fork(mm, oldmm);
> > +             khugepaged_fork(mm, oldmm);
> > +     } else {
> > +
> > +             /*
> > +              * The entire maple tree has already been duplicated. If the
> > +              * mmap duplication fails, mark the failure point with
> > +              * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> > +              * stop releasing VMAs that have not been duplicated after this
> > +              * point.
> > +              */
> > +             if (mpnt) {
> > +                     mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > +                     mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > +                     /* Avoid OOM iterating a broken tree */
> > +                     set_bit(MMF_OOM_SKIP, &mm->flags);
> > +             }
> > +             /*
> > +              * The mm_struct is going to exit, but the locks will be dropped
> > +              * first.  Set the mm_struct as unstable is advisable as it is
> > +              * not fully initialised.
> > +              */
> > +             set_bit(MMF_UNSTABLE, &mm->flags);
> > +     }
> > +out:
> > +     mmap_write_unlock(mm);
> > +     flush_tlb_mm(oldmm);
> > +     mmap_write_unlock(oldmm);
> > +     if (!retval)
> > +             dup_userfaultfd_complete(&uf);
> > +     else
> > +             dup_userfaultfd_fail(&uf);
> > +     return retval;
> > +
> > +fail_nomem_anon_vma_fork:
> > +     mpol_put(vma_policy(tmp));
> > +fail_nomem_policy:
> > +     vm_area_free(tmp);
> > +fail_nomem:
> > +     retval = -ENOMEM;
> > +     vm_unacct_memory(charge);
> > +     goto loop_out;
> >  }
> > -#endif
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 2b4d304c6445..a142fc258d39 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1874,3 +1874,11 @@ static int __meminit init_admin_reserve(void)
> >       return 0;
> >  }
> >  subsys_initcall(init_admin_reserve);
> > +
> > +int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > +{
> > +     mmap_write_lock(oldmm);
> > +     dup_mm_exe_file(mm, oldmm);
> > +     mmap_write_unlock(oldmm);
> > +     return 0;
> > +}
> > --
> > 2.49.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
@ 2025-04-29  7:12   ` Vlastimil Babka
  2025-04-29 16:54   ` Kees Cook
  2025-04-29 17:23   ` David Hildenbrand
  3 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-04-29  7:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, David Hildenbrand,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 4/28/25 17:28, Lorenzo Stoakes wrote:
> This is a key step in our being able to abstract and isolate VMA allocation
> and destruction logic.
> 
> This function is the last one where vm_area_free() and vm_area_dup() are
> directly referenced outside of mmap, so having this in mm allows us to
> isolate these.
> 
> We do the same for the nommu version which is substantially simpler.
> 
> We place the declaration for dup_mmap() in mm/internal.h and have
> kernel/fork.c import this in order to prevent improper use of this
> functionality elsewhere in the kernel.
> 
> While we're here, we remove the useless #ifdef CONFIG_MMU check around
> mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> CONFIG_MMU is set.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Suggested-by: Pedro Falcato <pfalcato@suse.de>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
  2025-04-28 19:12   ` Liam R. Howlett
  2025-04-29  7:12   ` Vlastimil Babka
@ 2025-04-29 16:54   ` Kees Cook
  2025-04-29 17:23   ` David Hildenbrand
  3 siblings, 0 replies; 36+ messages in thread
From: Kees Cook @ 2025-04-29 16:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 04:28:16PM +0100, Lorenzo Stoakes wrote:
> This is a key step in our being able to abstract and isolate VMA allocation
> and destruction logic.
> 
> This function is the last one where vm_area_free() and vm_area_dup() are
> directly referenced outside of mmap, so having this in mm allows us to
> isolate these.
> 
> We do the same for the nommu version which is substantially simpler.
> 
> We place the declaration for dup_mmap() in mm/internal.h and have
> kernel/fork.c import this in order to prevent improper use of this
> functionality elsewhere in the kernel.
> 
> While we're here, we remove the useless #ifdef CONFIG_MMU check around
> mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> CONFIG_MMU is set.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Kees Cook <kees@kernel.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/4] mm: move dup_mmap() to mm
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2025-04-29 16:54   ` Kees Cook
@ 2025-04-29 17:23   ` David Hildenbrand
  3 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-04-29 17:23 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 28.04.25 17:28, Lorenzo Stoakes wrote:
> This is a key step in our being able to abstract and isolate VMA allocation
> and destruction logic.
> 
> This function is the last one where vm_area_free() and vm_area_dup() are
> directly referenced outside of mmap, so having this in mm allows us to
> isolate these.
> 
> We do the same for the nommu version which is substantially simpler.
> 
> We place the declaration for dup_mmap() in mm/internal.h and have
> kernel/fork.c import this in order to prevent improper use of this
> functionality elsewhere in the kernel.
> 
> While we're here, we remove the useless #ifdef CONFIG_MMU check around
> mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> CONFIG_MMU is set.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Suggested-by: Pedro Falcato <pfalcato@suse.de>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
@ 2025-04-28 15:28 ` Lorenzo Stoakes
  2025-04-28 19:14   ` Liam R. Howlett
                     ` (6 more replies)
  2025-04-29  7:28 ` [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Vlastimil Babka
  4 siblings, 7 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Right now these are performed in kernel/fork.c which is odd and a violation
of separation of concerns, as well as preventing us from integrating this
and related logic into userland VMA testing going forward, and perhaps more
importantly - enabling us to, in a subsequent commit, make VMA
allocation/freeing a purely internal mm operation.

There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.

To square the circle, let's add a new file - vma_init.c. This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.

This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.

Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 MAINTAINERS                      |   1 +
 kernel/fork.c                    |  88 -------------------
 mm/Makefile                      |   2 +-
 mm/mmap.c                        |   3 +-
 mm/nommu.c                       |   4 +-
 mm/vma.h                         |   7 ++
 mm/vma_init.c                    | 101 ++++++++++++++++++++++
 tools/testing/vma/Makefile       |   2 +-
 tools/testing/vma/vma.c          |  26 ++++--
 tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
 10 files changed, 251 insertions(+), 126 deletions(-)
 create mode 100644 mm/vma_init.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1ee1c22e6e36..d274e6802ba5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15656,6 +15656,7 @@ F:	mm/mseal.c
 F:	mm/vma.c
 F:	mm/vma.h
 F:	mm/vma_exec.c
+F:	mm/vma_init.c
 F:	mm/vma_internal.h
 F:	tools/testing/selftests/mm/merge.c
 F:	tools/testing/vma/
diff --git a/kernel/fork.c b/kernel/fork.c
index ac9f9267a473..9e4616dacd82 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
 /* SLAB cache for fs_struct structures (tsk->fs) */
 struct kmem_cache *fs_cachep;
 
-/* SLAB cache for vm_area_struct structures */
-static struct kmem_cache *vm_area_cachep;
-
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
-
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (!vma)
-		return NULL;
-
-	vma_init(vma, mm);
-
-	return vma;
-}
-
-static void vm_area_init_from(const struct vm_area_struct *src,
-			      struct vm_area_struct *dest)
-{
-	dest->vm_mm = src->vm_mm;
-	dest->vm_ops = src->vm_ops;
-	dest->vm_start = src->vm_start;
-	dest->vm_end = src->vm_end;
-	dest->anon_vma = src->anon_vma;
-	dest->vm_pgoff = src->vm_pgoff;
-	dest->vm_file = src->vm_file;
-	dest->vm_private_data = src->vm_private_data;
-	vm_flags_init(dest, src->vm_flags);
-	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
-	       sizeof(dest->vm_page_prot));
-	/*
-	 * src->shared.rb may be modified concurrently when called from
-	 * dup_mmap(), but the clone will reinitialize it.
-	 */
-	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
-	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
-	       sizeof(dest->vm_userfaultfd_ctx));
-#ifdef CONFIG_ANON_VMA_NAME
-	dest->anon_name = src->anon_name;
-#endif
-#ifdef CONFIG_SWAP
-	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
-	       sizeof(dest->swap_readahead_info));
-#endif
-#ifndef CONFIG_MMU
-	dest->vm_region = src->vm_region;
-#endif
-#ifdef CONFIG_NUMA
-	dest->vm_policy = src->vm_policy;
-#endif
-}
-
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-
-	if (!new)
-		return NULL;
-
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	vm_area_init_from(orig, new);
-	vma_lock_init(new, true);
-	INIT_LIST_HEAD(&new->anon_vma_chain);
-	vma_numab_state_init(new);
-	dup_anon_vma_name(orig, new);
-
-	return new;
-}
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-	/* The vma should be detached while being destroyed. */
-	vma_assert_detached(vma);
-	vma_numab_state_free(vma);
-	free_anon_vma_name(vma);
-	kmem_cache_free(vm_area_cachep, vma);
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
-	struct kmem_cache_args args = {
-		.use_freeptr_offset = true,
-		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
-	};
-
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = kmem_cache_create("vm_area_struct",
-			sizeof(struct vm_area_struct), &args,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
-			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/mm/Makefile b/mm/Makefile
index 15a901bb431a..690ddcf7d9a1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -55,7 +55,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o show_mem.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 5259df031e15..81dd962a1cfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = {
 #endif /* CONFIG_SYSCTL */
 
 /*
- * initialise the percpu counter for VM
+ * initialise the percpu counter for VM, initialise VMA state.
  */
 void __init mmap_init(void)
 {
@@ -1565,6 +1565,7 @@ void __init mmap_init(void)
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", mmap_table);
 #endif
+	vma_state_init();
 }
 
 /*
diff --git a/mm/nommu.c b/mm/nommu.c
index a142fc258d39..0bf4849b8204 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
 };
 
 /*
- * initialise the percpu counter for VM and region record slabs
+ * initialise the percpu counter for VM and region record slabs, initialise VMA
+ * state.
  */
 void __init mmap_init(void)
 {
@@ -409,6 +410,7 @@ void __init mmap_init(void)
 	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 	register_sysctl_init("vm", nommu_table);
+	vma_state_init();
 }
 
 /*
diff --git a/mm/vma.h b/mm/vma.h
index 94307a2e4ab6..4a1e1768ca46 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -548,8 +548,15 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
+
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
 
+/* vma_init.h, shared between CONFIG_MMU and nommu. */
+void __init vma_state_init(void);
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
+void vm_area_free(struct vm_area_struct *vma);
+
 /* vma_exec.h */
 #ifdef CONFIG_MMU
 int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
diff --git a/mm/vma_init.c b/mm/vma_init.c
new file mode 100644
index 000000000000..967ca8517986
--- /dev/null
+++ b/mm/vma_init.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/*
+ * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
+ * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
+ */
+
+#include "vma_internal.h"
+#include "vma.h"
+
+/* SLAB cache for vm_area_struct structures */
+static struct kmem_cache *vm_area_cachep;
+
+void __init vma_state_init(void)
+{
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT);
+}
+
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+
+	return vma;
+}
+
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+	dest->vm_region = src->vm_region;
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	vm_area_init_from(orig, new);
+	vma_lock_init(new, true);
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	vma_numab_state_init(new);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
+	vma_numab_state_free(vma);
+	free_anon_vma_name(vma);
+	kmem_cache_free(vm_area_cachep, vma);
+}
diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
index 624040fcf193..66f3831a668f 100644
--- a/tools/testing/vma/Makefile
+++ b/tools/testing/vma/Makefile
@@ -9,7 +9,7 @@ include ../shared/shared.mk
 OFILES = $(SHARED_OFILES) vma.o maple-shim.o
 TARGETS = vma
 
-vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
+vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h
 
 vma:	$(OFILES)
 	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 5832ae5d797d..2be7597a2ac2 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
  * Directly import the VMA implementation here. Our vma_internal.h wrapper
  * provides userland-equivalent functionality for everything vma.c uses.
  */
+#include "../../../mm/vma_init.c"
 #include "../../../mm/vma_exec.c"
 #include "../../../mm/vma.c"
 
@@ -91,6 +92,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 	return res;
 }
 
+static void detach_free_vma(struct vm_area_struct *vma)
+{
+	vma_mark_detached(vma);
+	vm_area_free(vma);
+}
+
 /* Helper function to allocate a VMA and link it to the tree. */
 static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
 						 unsigned long start,
@@ -104,7 +111,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
 		return NULL;
 
 	if (attach_vma(mm, vma)) {
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		return NULL;
 	}
 
@@ -249,7 +256,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
 
 	vma_iter_set(vmi, 0);
 	for_each_vma(*vmi, vma) {
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		count++;
 	}
 
@@ -319,7 +326,7 @@ static bool test_simple_merge(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->vm_flags, flags);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -361,7 +368,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x1000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	vma_iter_clear(&vmi);
 
 	vma = vma_next(&vmi);
@@ -370,7 +377,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x2000);
 	ASSERT_EQ(vma->vm_pgoff, 1);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	vma_iter_clear(&vmi);
 
 	vma = vma_next(&vmi);
@@ -379,7 +386,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x3000);
 	ASSERT_EQ(vma->vm_pgoff, 2);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -407,7 +414,7 @@ static bool test_simple_expand(void)
 	ASSERT_EQ(vma->vm_end, 0x3000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -428,7 +435,7 @@ static bool test_simple_shrink(void)
 	ASSERT_EQ(vma->vm_end, 0x1000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -619,7 +626,7 @@ static bool test_merge_new(void)
 		ASSERT_EQ(vma->vm_pgoff, 0);
 		ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		count++;
 	}
 
@@ -1668,6 +1675,7 @@ int main(void)
 	int num_tests = 0, num_fail = 0;
 
 	maple_tree_init();
+	vma_state_init();
 
 #define TEST(name)							\
 	do {								\
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 32e990313158..198abe66de5a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
+#define data_race(expr) expr
+
+#define ASSERT_EXCLUSIVE_WRITER(x)
+
 struct kref {
 	refcount_t refcount;
 };
@@ -255,6 +259,8 @@ struct file {
 
 #define VMA_LOCK_OFFSET	0x40000000
 
+typedef struct { unsigned long v; } freeptr_t;
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -264,9 +270,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
@@ -463,6 +467,65 @@ struct pagetable_move_control {
 		.len_in = len_,						\
 	}
 
+struct kmem_cache_args {
+	/**
+	 * @align: The required alignment for the objects.
+	 *
+	 * %0 means no specific alignment is requested.
+	 */
+	unsigned int align;
+	/**
+	 * @useroffset: Usercopy region offset.
+	 *
+	 * %0 is a valid offset, when @usersize is non-%0
+	 */
+	unsigned int useroffset;
+	/**
+	 * @usersize: Usercopy region size.
+	 *
+	 * %0 means no usercopy region is specified.
+	 */
+	unsigned int usersize;
+	/**
+	 * @freeptr_offset: Custom offset for the free pointer
+	 * in &SLAB_TYPESAFE_BY_RCU caches
+	 *
+	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
+	 * outside of the object. This might cause the object to grow in size.
+	 * Cache creators that have a reason to avoid this can specify a custom
+	 * free pointer offset in their struct where the free pointer will be
+	 * placed.
+	 *
+	 * Note that placing the free pointer inside the object requires the
+	 * caller to ensure that no fields are invalidated that are required to
+	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
+	 * details).
+	 *
+	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
+	 * is specified, %use_freeptr_offset must be set %true.
+	 *
+	 * Note that @ctor currently isn't supported with custom free pointers
+	 * as a @ctor requires an external free pointer.
+	 */
+	unsigned int freeptr_offset;
+	/**
+	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
+	 */
+	bool use_freeptr_offset;
+	/**
+	 * @ctor: A constructor for the objects.
+	 *
+	 * The constructor is invoked for each object in a newly allocated slab
+	 * page. It is the cache user's responsibility to free object in the
+	 * same state as after calling the constructor, or deal appropriately
+	 * with any differences between a freshly constructed and a reallocated
+	 * object.
+	 *
+	 * %NULL means no constructor.
+	 */
+	void (*ctor)(void *);
+};
+
 static inline void vma_iter_invalidate(struct vma_iterator *vmi)
 {
 	mas_pause(&vmi->mas);
@@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
-static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
+struct kmem_cache {
+	const char *name;
+	size_t object_size;
+	struct kmem_cache_args *args;
+};
 
-	if (!vma)
-		return NULL;
+static inline struct kmem_cache *__kmem_cache_create(const char *name,
+						     size_t object_size,
+						     struct kmem_cache_args *args)
+{
+	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
 
-	vma_init(vma, mm);
+	ret->name = name;
+	ret->object_size = object_size;
+	ret->args = args;
 
-	return vma;
+	return ret;
 }
 
-static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	__kmem_cache_create((__name), (__object_size), (__args))
 
-	if (!new)
-		return NULL;
+static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	(void)gfpflags;
 
-	memcpy(new, orig, sizeof(*new));
-	refcount_set(&new->vm_refcnt, 0);
-	new->vm_lock_seq = UINT_MAX;
-	INIT_LIST_HEAD(&new->anon_vma_chain);
+	return calloc(s->object_size, 1);
+}
 
-	return new;
+static inline void kmem_cache_free(struct kmem_cache *s, void *x)
+{
+	free(x);
 }
 
 /*
@@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void lru_add_drain(void)
 {
 }
@@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm)
 	(void)mm;
 }
 
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
+{
+	(void)vma;
+	(void)reset_refcnt;
+}
+
+static inline void vma_numab_state_init(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void vma_numab_state_free(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
+				     struct vm_area_struct *new_vma)
+{
+	(void)orig_vma;
+	(void)new_vma;
+}
+
+static inline void free_anon_vma_name(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
@ 2025-04-28 19:14   ` Liam R. Howlett
  2025-04-28 20:28     ` Lorenzo Stoakes
  2025-04-28 20:29   ` Lorenzo Stoakes
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-04-28 19:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
> 
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
> 
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
> 
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
> 
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

One small nit below.

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  MAINTAINERS                      |   1 +
>  kernel/fork.c                    |  88 -------------------
>  mm/Makefile                      |   2 +-
>  mm/mmap.c                        |   3 +-
>  mm/nommu.c                       |   4 +-
>  mm/vma.h                         |   7 ++
>  mm/vma_init.c                    | 101 ++++++++++++++++++++++
>  tools/testing/vma/Makefile       |   2 +-
>  tools/testing/vma/vma.c          |  26 ++++--
>  tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
>  10 files changed, 251 insertions(+), 126 deletions(-)
>  create mode 100644 mm/vma_init.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1ee1c22e6e36..d274e6802ba5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -15656,6 +15656,7 @@ F:	mm/mseal.c
>  F:	mm/vma.c
>  F:	mm/vma.h
>  F:	mm/vma_exec.c
> +F:	mm/vma_init.c
>  F:	mm/vma_internal.h
>  F:	tools/testing/selftests/mm/merge.c
>  F:	tools/testing/vma/
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ac9f9267a473..9e4616dacd82 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
>  /* SLAB cache for fs_struct structures (tsk->fs) */
>  struct kmem_cache *fs_cachep;
>  
> -/* SLAB cache for vm_area_struct structures */
> -static struct kmem_cache *vm_area_cachep;
> -
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -	struct vm_area_struct *vma;
> -
> -	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -	if (!vma)
> -		return NULL;
> -
> -	vma_init(vma, mm);
> -
> -	return vma;
> -}
> -
> -static void vm_area_init_from(const struct vm_area_struct *src,
> -			      struct vm_area_struct *dest)
> -{
> -	dest->vm_mm = src->vm_mm;
> -	dest->vm_ops = src->vm_ops;
> -	dest->vm_start = src->vm_start;
> -	dest->vm_end = src->vm_end;
> -	dest->anon_vma = src->anon_vma;
> -	dest->vm_pgoff = src->vm_pgoff;
> -	dest->vm_file = src->vm_file;
> -	dest->vm_private_data = src->vm_private_data;
> -	vm_flags_init(dest, src->vm_flags);
> -	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> -	       sizeof(dest->vm_page_prot));
> -	/*
> -	 * src->shared.rb may be modified concurrently when called from
> -	 * dup_mmap(), but the clone will reinitialize it.
> -	 */
> -	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> -	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> -	       sizeof(dest->vm_userfaultfd_ctx));
> -#ifdef CONFIG_ANON_VMA_NAME
> -	dest->anon_name = src->anon_name;
> -#endif
> -#ifdef CONFIG_SWAP
> -	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> -	       sizeof(dest->swap_readahead_info));
> -#endif
> -#ifndef CONFIG_MMU
> -	dest->vm_region = src->vm_region;
> -#endif
> -#ifdef CONFIG_NUMA
> -	dest->vm_policy = src->vm_policy;
> -#endif
> -}
> -
> -struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -
> -	if (!new)
> -		return NULL;
> -
> -	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> -	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> -	vm_area_init_from(orig, new);
> -	vma_lock_init(new, true);
> -	INIT_LIST_HEAD(&new->anon_vma_chain);
> -	vma_numab_state_init(new);
> -	dup_anon_vma_name(orig, new);
> -
> -	return new;
> -}
> -
> -void vm_area_free(struct vm_area_struct *vma)
> -{
> -	/* The vma should be detached while being destroyed. */
> -	vma_assert_detached(vma);
> -	vma_numab_state_free(vma);
> -	free_anon_vma_name(vma);
> -	kmem_cache_free(vm_area_cachep, vma);
> -}
> -
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> @@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
>  
>  void __init proc_caches_init(void)
>  {
> -	struct kmem_cache_args args = {
> -		.use_freeptr_offset = true,
> -		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> -	};
> -
>  	sighand_cachep = kmem_cache_create("sighand_cache",
>  			sizeof(struct sighand_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -	vm_area_cachep = kmem_cache_create("vm_area_struct",
> -			sizeof(struct vm_area_struct), &args,
> -			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> -			SLAB_ACCOUNT);
>  	mmap_init();
>  	nsproxy_cache_init();
>  }
> diff --git a/mm/Makefile b/mm/Makefile
> index 15a901bb431a..690ddcf7d9a1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -55,7 +55,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   mm_init.o percpu.o slab_common.o \
>  			   compaction.o show_mem.o \
>  			   interval_tree.o list_lru.o workingset.o \
> -			   debug.o gup.o mmap_lock.o $(mmu-y)
> +			   debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
>  
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5259df031e15..81dd962a1cfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = {
>  #endif /* CONFIG_SYSCTL */
>  
>  /*
> - * initialise the percpu counter for VM
> + * initialise the percpu counter for VM, initialise VMA state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -1565,6 +1565,7 @@ void __init mmap_init(void)
>  #ifdef CONFIG_SYSCTL
>  	register_sysctl_init("vm", mmap_table);
>  #endif
> +	vma_state_init();
>  }
>  
>  /*
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a142fc258d39..0bf4849b8204 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
>  };
>  
>  /*
> - * initialise the percpu counter for VM and region record slabs
> + * initialise the percpu counter for VM and region record slabs, initialise VMA
> + * state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -409,6 +410,7 @@ void __init mmap_init(void)
>  	VM_BUG_ON(ret);
>  	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  	register_sysctl_init("vm", nommu_table);
> +	vma_state_init();
>  }
>  
>  /*
> diff --git a/mm/vma.h b/mm/vma.h
> index 94307a2e4ab6..4a1e1768ca46 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -548,8 +548,15 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>  
>  int __vm_munmap(unsigned long start, size_t len, bool unlock);
>  
> +

Accidental extra line here?

>  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
>  
> +/* vma_init.h, shared between CONFIG_MMU and nommu. */
> +void __init vma_state_init(void);
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
> +void vm_area_free(struct vm_area_struct *vma);
> +
>  /* vma_exec.h */
>  #ifdef CONFIG_MMU
>  int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> new file mode 100644
> index 000000000000..967ca8517986
> --- /dev/null
> +++ b/mm/vma_init.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/*
> + * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
> + * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
> + */
> +
> +#include "vma_internal.h"
> +#include "vma.h"
> +
> +/* SLAB cache for vm_area_struct structures */
> +static struct kmem_cache *vm_area_cachep;
> +
> +void __init vma_state_init(void)
> +{
> +	struct kmem_cache_args args = {
> +		.use_freeptr_offset = true,
> +		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +	};
> +
> +	vm_area_cachep = kmem_cache_create("vm_area_struct",
> +			sizeof(struct vm_area_struct), &args,
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> +			SLAB_ACCOUNT);
> +}
> +
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +	if (!vma)
> +		return NULL;
> +
> +	vma_init(vma, mm);
> +
> +	return vma;
> +}
> +
> +static void vm_area_init_from(const struct vm_area_struct *src,
> +			      struct vm_area_struct *dest)
> +{
> +	dest->vm_mm = src->vm_mm;
> +	dest->vm_ops = src->vm_ops;
> +	dest->vm_start = src->vm_start;
> +	dest->vm_end = src->vm_end;
> +	dest->anon_vma = src->anon_vma;
> +	dest->vm_pgoff = src->vm_pgoff;
> +	dest->vm_file = src->vm_file;
> +	dest->vm_private_data = src->vm_private_data;
> +	vm_flags_init(dest, src->vm_flags);
> +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +	       sizeof(dest->vm_page_prot));
> +	/*
> +	 * src->shared.rb may be modified concurrently when called from
> +	 * dup_mmap(), but the clone will reinitialize it.
> +	 */
> +	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +	       sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +	dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +	       sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +	dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> +{
> +	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +
> +	if (!new)
> +		return NULL;
> +
> +	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> +	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> +	vm_area_init_from(orig, new);
> +	vma_lock_init(new, true);
> +	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	vma_numab_state_init(new);
> +	dup_anon_vma_name(orig, new);
> +
> +	return new;
> +}
> +
> +void vm_area_free(struct vm_area_struct *vma)
> +{
> +	/* The vma should be detached while being destroyed. */
> +	vma_assert_detached(vma);
> +	vma_numab_state_free(vma);
> +	free_anon_vma_name(vma);
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> index 624040fcf193..66f3831a668f 100644
> --- a/tools/testing/vma/Makefile
> +++ b/tools/testing/vma/Makefile
> @@ -9,7 +9,7 @@ include ../shared/shared.mk
>  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
>  TARGETS = vma
>  
> -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h
>  
>  vma:	$(OFILES)
>  	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 5832ae5d797d..2be7597a2ac2 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
>   * Directly import the VMA implementation here. Our vma_internal.h wrapper
>   * provides userland-equivalent functionality for everything vma.c uses.
>   */
> +#include "../../../mm/vma_init.c"
>  #include "../../../mm/vma_exec.c"
>  #include "../../../mm/vma.c"
>  
> @@ -91,6 +92,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
>  	return res;
>  }
>  
> +static void detach_free_vma(struct vm_area_struct *vma)
> +{
> +	vma_mark_detached(vma);
> +	vm_area_free(vma);
> +}
> +
>  /* Helper function to allocate a VMA and link it to the tree. */
>  static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>  						 unsigned long start,
> @@ -104,7 +111,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>  		return NULL;
>  
>  	if (attach_vma(mm, vma)) {
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		return NULL;
>  	}
>  
> @@ -249,7 +256,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
>  
>  	vma_iter_set(vmi, 0);
>  	for_each_vma(*vmi, vma) {
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		count++;
>  	}
>  
> @@ -319,7 +326,7 @@ static bool test_simple_merge(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->vm_flags, flags);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>  
>  	return true;
> @@ -361,7 +368,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x1000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	vma_iter_clear(&vmi);
>  
>  	vma = vma_next(&vmi);
> @@ -370,7 +377,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x2000);
>  	ASSERT_EQ(vma->vm_pgoff, 1);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	vma_iter_clear(&vmi);
>  
>  	vma = vma_next(&vmi);
> @@ -379,7 +386,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x3000);
>  	ASSERT_EQ(vma->vm_pgoff, 2);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>  
>  	return true;
> @@ -407,7 +414,7 @@ static bool test_simple_expand(void)
>  	ASSERT_EQ(vma->vm_end, 0x3000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>  
>  	return true;
> @@ -428,7 +435,7 @@ static bool test_simple_shrink(void)
>  	ASSERT_EQ(vma->vm_end, 0x1000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>  
>  	return true;
> @@ -619,7 +626,7 @@ static bool test_merge_new(void)
>  		ASSERT_EQ(vma->vm_pgoff, 0);
>  		ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		count++;
>  	}
>  
> @@ -1668,6 +1675,7 @@ int main(void)
>  	int num_tests = 0, num_fail = 0;
>  
>  	maple_tree_init();
> +	vma_state_init();
>  
>  #define TEST(name)							\
>  	do {								\
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 32e990313158..198abe66de5a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t;
>   */
>  #define pr_warn_once pr_err
>  
> +#define data_race(expr) expr
> +
> +#define ASSERT_EXCLUSIVE_WRITER(x)
> +
>  struct kref {
>  	refcount_t refcount;
>  };
> @@ -255,6 +259,8 @@ struct file {
>  
>  #define VMA_LOCK_OFFSET	0x40000000
>  
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  struct vm_area_struct {
>  	/* The first cache line has the info for VMA tree walking. */
>  
> @@ -264,9 +270,7 @@ struct vm_area_struct {
>  			unsigned long vm_start;
>  			unsigned long vm_end;
>  		};
> -#ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> -#endif
> +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>  	};
>  
>  	struct mm_struct *vm_mm;	/* The address space we belong to. */
> @@ -463,6 +467,65 @@ struct pagetable_move_control {
>  		.len_in = len_,						\
>  	}
>  
> +struct kmem_cache_args {
> +	/**
> +	 * @align: The required alignment for the objects.
> +	 *
> +	 * %0 means no specific alignment is requested.
> +	 */
> +	unsigned int align;
> +	/**
> +	 * @useroffset: Usercopy region offset.
> +	 *
> +	 * %0 is a valid offset, when @usersize is non-%0
> +	 */
> +	unsigned int useroffset;
> +	/**
> +	 * @usersize: Usercopy region size.
> +	 *
> +	 * %0 means no usercopy region is specified.
> +	 */
> +	unsigned int usersize;
> +	/**
> +	 * @freeptr_offset: Custom offset for the free pointer
> +	 * in &SLAB_TYPESAFE_BY_RCU caches
> +	 *
> +	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> +	 * outside of the object. This might cause the object to grow in size.
> +	 * Cache creators that have a reason to avoid this can specify a custom
> +	 * free pointer offset in their struct where the free pointer will be
> +	 * placed.
> +	 *
> +	 * Note that placing the free pointer inside the object requires the
> +	 * caller to ensure that no fields are invalidated that are required to
> +	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> +	 * details).
> +	 *
> +	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> +	 * is specified, %use_freeptr_offset must be set %true.
> +	 *
> +	 * Note that @ctor currently isn't supported with custom free pointers
> +	 * as a @ctor requires an external free pointer.
> +	 */
> +	unsigned int freeptr_offset;
> +	/**
> +	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
> +	 */
> +	bool use_freeptr_offset;
> +	/**
> +	 * @ctor: A constructor for the objects.
> +	 *
> +	 * The constructor is invoked for each object in a newly allocated slab
> +	 * page. It is the cache user's responsibility to free object in the
> +	 * same state as after calling the constructor, or deal appropriately
> +	 * with any differences between a freshly constructed and a reallocated
> +	 * object.
> +	 *
> +	 * %NULL means no constructor.
> +	 */
> +	void (*ctor)(void *);
> +};
> +
>  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
>  {
>  	mas_pause(&vmi->mas);
> @@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_lock_seq = UINT_MAX;
>  }
>  
> -static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
> +struct kmem_cache {
> +	const char *name;
> +	size_t object_size;
> +	struct kmem_cache_args *args;
> +};
>  
> -	if (!vma)
> -		return NULL;
> +static inline struct kmem_cache *__kmem_cache_create(const char *name,
> +						     size_t object_size,
> +						     struct kmem_cache_args *args)
> +{
> +	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
>  
> -	vma_init(vma, mm);
> +	ret->name = name;
> +	ret->object_size = object_size;
> +	ret->args = args;
>  
> -	return vma;
> +	return ret;
>  }
>  
> -static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +	__kmem_cache_create((__name), (__object_size), (__args))
>  
> -	if (!new)
> -		return NULL;
> +static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	(void)gfpflags;
>  
> -	memcpy(new, orig, sizeof(*new));
> -	refcount_set(&new->vm_refcnt, 0);
> -	new->vm_lock_seq = UINT_MAX;
> -	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	return calloc(s->object_size, 1);
> +}
>  
> -	return new;
> +static inline void kmem_cache_free(struct kmem_cache *s, void *x)
> +{
> +	free(x);
>  }
>  
>  /*
> @@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *)
>  {
>  }
>  
> -static inline void vm_area_free(struct vm_area_struct *vma)
> -{
> -	free(vma);
> -}
> -
>  static inline void lru_add_drain(void)
>  {
>  }
> @@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm)
>  	(void)mm;
>  }
>  
> +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> +{
> +	(void)vma;
> +	(void)reset_refcnt;
> +}
> +
> +static inline void vma_numab_state_init(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
> +static inline void vma_numab_state_free(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
> +static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
> +				     struct vm_area_struct *new_vma)
> +{
> +	(void)orig_vma;
> +	(void)new_vma;
> +}
> +
> +static inline void free_anon_vma_name(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 19:14   ` Liam R. Howlett
@ 2025-04-28 20:28     ` Lorenzo Stoakes
  0 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:28 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 03:14:46PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250428 11:28]:
> > Right now these are performed in kernel/fork.c which is odd and a violation
> > of separation of concerns, as well as preventing us from integrating this
> > and related logic into userland VMA testing going forward, and perhaps more
> > importantly - enabling us to, in a subsequent commit, make VMA
> > allocation/freeing a purely internal mm operation.
> >
> > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > CONFIG_MMU not set, and neither is vma.c.
> >
> > To square the circle, let's add a new file - vma_init.c. This will be
> > compiled for both CONFIG_MMU and nommu builds, and will also form part of
> > the VMA userland testing.
> >
> > This allows us to de-duplicate code, while maintaining separation of
> > concerns and the ability for us to userland test this logic.
> >
> > Update the VMA userland tests accordingly, additionally adding a
> > detach_free_vma() helper function to correctly detach VMAs before freeing
> > them in test code, as this change was triggering the assert for this.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> One small nit below.
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Thanks!

I see Andrew already fixed up the nit :)

>
> > ---
> >  MAINTAINERS                      |   1 +
> >  kernel/fork.c                    |  88 -------------------
> >  mm/Makefile                      |   2 +-
> >  mm/mmap.c                        |   3 +-
> >  mm/nommu.c                       |   4 +-
> >  mm/vma.h                         |   7 ++
> >  mm/vma_init.c                    | 101 ++++++++++++++++++++++
> >  tools/testing/vma/Makefile       |   2 +-
> >  tools/testing/vma/vma.c          |  26 ++++--
> >  tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
> >  10 files changed, 251 insertions(+), 126 deletions(-)
> >  create mode 100644 mm/vma_init.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1ee1c22e6e36..d274e6802ba5 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -15656,6 +15656,7 @@ F:	mm/mseal.c
> >  F:	mm/vma.c
> >  F:	mm/vma.h
> >  F:	mm/vma_exec.c
> > +F:	mm/vma_init.c
> >  F:	mm/vma_internal.h
> >  F:	tools/testing/selftests/mm/merge.c
> >  F:	tools/testing/vma/
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index ac9f9267a473..9e4616dacd82 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
> >  /* SLAB cache for fs_struct structures (tsk->fs) */
> >  struct kmem_cache *fs_cachep;
> >
> > -/* SLAB cache for vm_area_struct structures */
> > -static struct kmem_cache *vm_area_cachep;
> > -
> >  /* SLAB cache for mm_struct structures (tsk->mm) */
> >  static struct kmem_cache *mm_cachep;
> >
> > -struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > -{
> > -	struct vm_area_struct *vma;
> > -
> > -	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > -	if (!vma)
> > -		return NULL;
> > -
> > -	vma_init(vma, mm);
> > -
> > -	return vma;
> > -}
> > -
> > -static void vm_area_init_from(const struct vm_area_struct *src,
> > -			      struct vm_area_struct *dest)
> > -{
> > -	dest->vm_mm = src->vm_mm;
> > -	dest->vm_ops = src->vm_ops;
> > -	dest->vm_start = src->vm_start;
> > -	dest->vm_end = src->vm_end;
> > -	dest->anon_vma = src->anon_vma;
> > -	dest->vm_pgoff = src->vm_pgoff;
> > -	dest->vm_file = src->vm_file;
> > -	dest->vm_private_data = src->vm_private_data;
> > -	vm_flags_init(dest, src->vm_flags);
> > -	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > -	       sizeof(dest->vm_page_prot));
> > -	/*
> > -	 * src->shared.rb may be modified concurrently when called from
> > -	 * dup_mmap(), but the clone will reinitialize it.
> > -	 */
> > -	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > -	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > -	       sizeof(dest->vm_userfaultfd_ctx));
> > -#ifdef CONFIG_ANON_VMA_NAME
> > -	dest->anon_name = src->anon_name;
> > -#endif
> > -#ifdef CONFIG_SWAP
> > -	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > -	       sizeof(dest->swap_readahead_info));
> > -#endif
> > -#ifndef CONFIG_MMU
> > -	dest->vm_region = src->vm_region;
> > -#endif
> > -#ifdef CONFIG_NUMA
> > -	dest->vm_policy = src->vm_policy;
> > -#endif
> > -}
> > -
> > -struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > -{
> > -	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > -
> > -	if (!new)
> > -		return NULL;
> > -
> > -	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > -	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > -	vm_area_init_from(orig, new);
> > -	vma_lock_init(new, true);
> > -	INIT_LIST_HEAD(&new->anon_vma_chain);
> > -	vma_numab_state_init(new);
> > -	dup_anon_vma_name(orig, new);
> > -
> > -	return new;
> > -}
> > -
> > -void vm_area_free(struct vm_area_struct *vma)
> > -{
> > -	/* The vma should be detached while being destroyed. */
> > -	vma_assert_detached(vma);
> > -	vma_numab_state_free(vma);
> > -	free_anon_vma_name(vma);
> > -	kmem_cache_free(vm_area_cachep, vma);
> > -}
> > -
> >  static void account_kernel_stack(struct task_struct *tsk, int account)
> >  {
> >  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > @@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
> >
> >  void __init proc_caches_init(void)
> >  {
> > -	struct kmem_cache_args args = {
> > -		.use_freeptr_offset = true,
> > -		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > -	};
> > -
> >  	sighand_cachep = kmem_cache_create("sighand_cache",
> >  			sizeof(struct sighand_struct), 0,
> >  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > @@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
> >  			sizeof(struct fs_struct), 0,
> >  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> >  			NULL);
> > -	vm_area_cachep = kmem_cache_create("vm_area_struct",
> > -			sizeof(struct vm_area_struct), &args,
> > -			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > -			SLAB_ACCOUNT);
> >  	mmap_init();
> >  	nsproxy_cache_init();
> >  }
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 15a901bb431a..690ddcf7d9a1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -55,7 +55,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
> >  			   mm_init.o percpu.o slab_common.o \
> >  			   compaction.o show_mem.o \
> >  			   interval_tree.o list_lru.o workingset.o \
> > -			   debug.o gup.o mmap_lock.o $(mmu-y)
> > +			   debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
> >
> >  # Give 'page_alloc' its own module-parameter namespace
> >  page-alloc-y := page_alloc.o
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 5259df031e15..81dd962a1cfc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = {
> >  #endif /* CONFIG_SYSCTL */
> >
> >  /*
> > - * initialise the percpu counter for VM
> > + * initialise the percpu counter for VM, initialise VMA state.
> >   */
> >  void __init mmap_init(void)
> >  {
> > @@ -1565,6 +1565,7 @@ void __init mmap_init(void)
> >  #ifdef CONFIG_SYSCTL
> >  	register_sysctl_init("vm", mmap_table);
> >  #endif
> > +	vma_state_init();
> >  }
> >
> >  /*
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index a142fc258d39..0bf4849b8204 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
> >  };
> >
> >  /*
> > - * initialise the percpu counter for VM and region record slabs
> > + * initialise the percpu counter for VM and region record slabs, initialise VMA
> > + * state.
> >   */
> >  void __init mmap_init(void)
> >  {
> > @@ -409,6 +410,7 @@ void __init mmap_init(void)
> >  	VM_BUG_ON(ret);
> >  	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
> >  	register_sysctl_init("vm", nommu_table);
> > +	vma_state_init();
> >  }
> >
> >  /*
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 94307a2e4ab6..4a1e1768ca46 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -548,8 +548,15 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> >
> >  int __vm_munmap(unsigned long start, size_t len, bool unlock);
> >
> > +
>
> Accidental extra line here?
>
> >  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
> >
> > +/* vma_init.h, shared between CONFIG_MMU and nommu. */
> > +void __init vma_state_init(void);
> > +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
> > +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
> > +void vm_area_free(struct vm_area_struct *vma);
> > +
> >  /* vma_exec.h */
> >  #ifdef CONFIG_MMU
> >  int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> > diff --git a/mm/vma_init.c b/mm/vma_init.c
> > new file mode 100644
> > index 000000000000..967ca8517986
> > --- /dev/null
> > +++ b/mm/vma_init.c
> > @@ -0,0 +1,101 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/*
> > + * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
> > + * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
> > + */
> > +
> > +#include "vma_internal.h"
> > +#include "vma.h"
> > +
> > +/* SLAB cache for vm_area_struct structures */
> > +static struct kmem_cache *vm_area_cachep;
> > +
> > +void __init vma_state_init(void)
> > +{
> > +	struct kmem_cache_args args = {
> > +		.use_freeptr_offset = true,
> > +		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > +	};
> > +
> > +	vm_area_cachep = kmem_cache_create("vm_area_struct",
> > +			sizeof(struct vm_area_struct), &args,
> > +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > +			SLAB_ACCOUNT);
> > +}
> > +
> > +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > +{
> > +	struct vm_area_struct *vma;
> > +
> > +	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > +	if (!vma)
> > +		return NULL;
> > +
> > +	vma_init(vma, mm);
> > +
> > +	return vma;
> > +}
> > +
> > +static void vm_area_init_from(const struct vm_area_struct *src,
> > +			      struct vm_area_struct *dest)
> > +{
> > +	dest->vm_mm = src->vm_mm;
> > +	dest->vm_ops = src->vm_ops;
> > +	dest->vm_start = src->vm_start;
> > +	dest->vm_end = src->vm_end;
> > +	dest->anon_vma = src->anon_vma;
> > +	dest->vm_pgoff = src->vm_pgoff;
> > +	dest->vm_file = src->vm_file;
> > +	dest->vm_private_data = src->vm_private_data;
> > +	vm_flags_init(dest, src->vm_flags);
> > +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > +	       sizeof(dest->vm_page_prot));
> > +	/*
> > +	 * src->shared.rb may be modified concurrently when called from
> > +	 * dup_mmap(), but the clone will reinitialize it.
> > +	 */
> > +	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > +	       sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +	dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > +	       sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +	dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +	dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> > +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > +{
> > +	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > +
> > +	if (!new)
> > +		return NULL;
> > +
> > +	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > +	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > +	vm_area_init_from(orig, new);
> > +	vma_lock_init(new, true);
> > +	INIT_LIST_HEAD(&new->anon_vma_chain);
> > +	vma_numab_state_init(new);
> > +	dup_anon_vma_name(orig, new);
> > +
> > +	return new;
> > +}
> > +
> > +void vm_area_free(struct vm_area_struct *vma)
> > +{
> > +	/* The vma should be detached while being destroyed. */
> > +	vma_assert_detached(vma);
> > +	vma_numab_state_free(vma);
> > +	free_anon_vma_name(vma);
> > +	kmem_cache_free(vm_area_cachep, vma);
> > +}
> > diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> > index 624040fcf193..66f3831a668f 100644
> > --- a/tools/testing/vma/Makefile
> > +++ b/tools/testing/vma/Makefile
> > @@ -9,7 +9,7 @@ include ../shared/shared.mk
> >  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
> >  TARGETS = vma
> >
> > -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> > +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h
> >
> >  vma:	$(OFILES)
> >  	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > index 5832ae5d797d..2be7597a2ac2 100644
> > --- a/tools/testing/vma/vma.c
> > +++ b/tools/testing/vma/vma.c
> > @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
> >   * Directly import the VMA implementation here. Our vma_internal.h wrapper
> >   * provides userland-equivalent functionality for everything vma.c uses.
> >   */
> > +#include "../../../mm/vma_init.c"
> >  #include "../../../mm/vma_exec.c"
> >  #include "../../../mm/vma.c"
> >
> > @@ -91,6 +92,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> >  	return res;
> >  }
> >
> > +static void detach_free_vma(struct vm_area_struct *vma)
> > +{
> > +	vma_mark_detached(vma);
> > +	vm_area_free(vma);
> > +}
> > +
> >  /* Helper function to allocate a VMA and link it to the tree. */
> >  static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
> >  						 unsigned long start,
> > @@ -104,7 +111,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
> >  		return NULL;
> >
> >  	if (attach_vma(mm, vma)) {
> > -		vm_area_free(vma);
> > +		detach_free_vma(vma);
> >  		return NULL;
> >  	}
> >
> > @@ -249,7 +256,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
> >
> >  	vma_iter_set(vmi, 0);
> >  	for_each_vma(*vmi, vma) {
> > -		vm_area_free(vma);
> > +		detach_free_vma(vma);
> >  		count++;
> >  	}
> >
> > @@ -319,7 +326,7 @@ static bool test_simple_merge(void)
> >  	ASSERT_EQ(vma->vm_pgoff, 0);
> >  	ASSERT_EQ(vma->vm_flags, flags);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	mtree_destroy(&mm.mm_mt);
> >
> >  	return true;
> > @@ -361,7 +368,7 @@ static bool test_simple_modify(void)
> >  	ASSERT_EQ(vma->vm_end, 0x1000);
> >  	ASSERT_EQ(vma->vm_pgoff, 0);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	vma_iter_clear(&vmi);
> >
> >  	vma = vma_next(&vmi);
> > @@ -370,7 +377,7 @@ static bool test_simple_modify(void)
> >  	ASSERT_EQ(vma->vm_end, 0x2000);
> >  	ASSERT_EQ(vma->vm_pgoff, 1);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	vma_iter_clear(&vmi);
> >
> >  	vma = vma_next(&vmi);
> > @@ -379,7 +386,7 @@ static bool test_simple_modify(void)
> >  	ASSERT_EQ(vma->vm_end, 0x3000);
> >  	ASSERT_EQ(vma->vm_pgoff, 2);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	mtree_destroy(&mm.mm_mt);
> >
> >  	return true;
> > @@ -407,7 +414,7 @@ static bool test_simple_expand(void)
> >  	ASSERT_EQ(vma->vm_end, 0x3000);
> >  	ASSERT_EQ(vma->vm_pgoff, 0);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	mtree_destroy(&mm.mm_mt);
> >
> >  	return true;
> > @@ -428,7 +435,7 @@ static bool test_simple_shrink(void)
> >  	ASSERT_EQ(vma->vm_end, 0x1000);
> >  	ASSERT_EQ(vma->vm_pgoff, 0);
> >
> > -	vm_area_free(vma);
> > +	detach_free_vma(vma);
> >  	mtree_destroy(&mm.mm_mt);
> >
> >  	return true;
> > @@ -619,7 +626,7 @@ static bool test_merge_new(void)
> >  		ASSERT_EQ(vma->vm_pgoff, 0);
> >  		ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >
> > -		vm_area_free(vma);
> > +		detach_free_vma(vma);
> >  		count++;
> >  	}
> >
> > @@ -1668,6 +1675,7 @@ int main(void)
> >  	int num_tests = 0, num_fail = 0;
> >
> >  	maple_tree_init();
> > +	vma_state_init();
> >
> >  #define TEST(name)							\
> >  	do {								\
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 32e990313158..198abe66de5a 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t;
> >   */
> >  #define pr_warn_once pr_err
> >
> > +#define data_race(expr) expr
> > +
> > +#define ASSERT_EXCLUSIVE_WRITER(x)
> > +
> >  struct kref {
> >  	refcount_t refcount;
> >  };
> > @@ -255,6 +259,8 @@ struct file {
> >
> >  #define VMA_LOCK_OFFSET	0x40000000
> >
> > +typedef struct { unsigned long v; } freeptr_t;
> > +
> >  struct vm_area_struct {
> >  	/* The first cache line has the info for VMA tree walking. */
> >
> > @@ -264,9 +270,7 @@ struct vm_area_struct {
> >  			unsigned long vm_start;
> >  			unsigned long vm_end;
> >  		};
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> > -#endif
> > +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> >  	};
> >
> >  	struct mm_struct *vm_mm;	/* The address space we belong to. */
> > @@ -463,6 +467,65 @@ struct pagetable_move_control {
> >  		.len_in = len_,						\
> >  	}
> >
> > +struct kmem_cache_args {
> > +	/**
> > +	 * @align: The required alignment for the objects.
> > +	 *
> > +	 * %0 means no specific alignment is requested.
> > +	 */
> > +	unsigned int align;
> > +	/**
> > +	 * @useroffset: Usercopy region offset.
> > +	 *
> > +	 * %0 is a valid offset, when @usersize is non-%0
> > +	 */
> > +	unsigned int useroffset;
> > +	/**
> > +	 * @usersize: Usercopy region size.
> > +	 *
> > +	 * %0 means no usercopy region is specified.
> > +	 */
> > +	unsigned int usersize;
> > +	/**
> > +	 * @freeptr_offset: Custom offset for the free pointer
> > +	 * in &SLAB_TYPESAFE_BY_RCU caches
> > +	 *
> > +	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> > +	 * outside of the object. This might cause the object to grow in size.
> > +	 * Cache creators that have a reason to avoid this can specify a custom
> > +	 * free pointer offset in their struct where the free pointer will be
> > +	 * placed.
> > +	 *
> > +	 * Note that placing the free pointer inside the object requires the
> > +	 * caller to ensure that no fields are invalidated that are required to
> > +	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> > +	 * details).
> > +	 *
> > +	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> > +	 * is specified, %use_freeptr_offset must be set %true.
> > +	 *
> > +	 * Note that @ctor currently isn't supported with custom free pointers
> > +	 * as a @ctor requires an external free pointer.
> > +	 */
> > +	unsigned int freeptr_offset;
> > +	/**
> > +	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
> > +	 */
> > +	bool use_freeptr_offset;
> > +	/**
> > +	 * @ctor: A constructor for the objects.
> > +	 *
> > +	 * The constructor is invoked for each object in a newly allocated slab
> > +	 * page. It is the cache user's responsibility to free object in the
> > +	 * same state as after calling the constructor, or deal appropriately
> > +	 * with any differences between a freshly constructed and a reallocated
> > +	 * object.
> > +	 *
> > +	 * %NULL means no constructor.
> > +	 */
> > +	void (*ctor)(void *);
> > +};
> > +
> >  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
> >  {
> >  	mas_pause(&vmi->mas);
> > @@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> > -static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > -{
> > -	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
> > +struct kmem_cache {
> > +	const char *name;
> > +	size_t object_size;
> > +	struct kmem_cache_args *args;
> > +};
> >
> > -	if (!vma)
> > -		return NULL;
> > +static inline struct kmem_cache *__kmem_cache_create(const char *name,
> > +						     size_t object_size,
> > +						     struct kmem_cache_args *args)
> > +{
> > +	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
> >
> > -	vma_init(vma, mm);
> > +	ret->name = name;
> > +	ret->object_size = object_size;
> > +	ret->args = args;
> >
> > -	return vma;
> > +	return ret;
> >  }
> >
> > -static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > -{
> > -	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
> > +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> > +	__kmem_cache_create((__name), (__object_size), (__args))
> >
> > -	if (!new)
> > -		return NULL;
> > +static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > +	(void)gfpflags;
> >
> > -	memcpy(new, orig, sizeof(*new));
> > -	refcount_set(&new->vm_refcnt, 0);
> > -	new->vm_lock_seq = UINT_MAX;
> > -	INIT_LIST_HEAD(&new->anon_vma_chain);
> > +	return calloc(s->object_size, 1);
> > +}
> >
> > -	return new;
> > +static inline void kmem_cache_free(struct kmem_cache *s, void *x)
> > +{
> > +	free(x);
> >  }
> >
> >  /*
> > @@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *)
> >  {
> >  }
> >
> > -static inline void vm_area_free(struct vm_area_struct *vma)
> > -{
> > -	free(vma);
> > -}
> > -
> >  static inline void lru_add_drain(void)
> >  {
> >  }
> > @@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm)
> >  	(void)mm;
> >  }
> >
> > +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> > +{
> > +	(void)vma;
> > +	(void)reset_refcnt;
> > +}
> > +
> > +static inline void vma_numab_state_init(struct vm_area_struct *vma)
> > +{
> > +	(void)vma;
> > +}
> > +
> > +static inline void vma_numab_state_free(struct vm_area_struct *vma)
> > +{
> > +	(void)vma;
> > +}
> > +
> > +static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
> > +				     struct vm_area_struct *new_vma)
> > +{
> > +	(void)orig_vma;
> > +	(void)new_vma;
> > +}
> > +
> > +static inline void free_anon_vma_name(struct vm_area_struct *vma)
> > +{
> > +	(void)vma;
> > +}
> > +
> >  #endif	/* __MM_VMA_INTERNAL_H */
> > --
> > 2.49.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2025-04-28 19:14   ` Liam R. Howlett
@ 2025-04-28 20:29   ` Lorenzo Stoakes
  2025-04-29  7:22   ` Vlastimil Babka
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Andrew - I maanged to typo vma_init.c to vma_init.h here in mm/vma.c (at least I
was consistent in these silly typos...).

Would you prefer a fixpatch or is it ok for you to quickly fix that up?

Thanks!

On Mon, Apr 28, 2025 at 04:28:17PM +0100, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
>
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
>
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
>
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
>
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  MAINTAINERS                      |   1 +
>  kernel/fork.c                    |  88 -------------------
>  mm/Makefile                      |   2 +-
>  mm/mmap.c                        |   3 +-
>  mm/nommu.c                       |   4 +-
>  mm/vma.h                         |   7 ++
>  mm/vma_init.c                    | 101 ++++++++++++++++++++++
>  tools/testing/vma/Makefile       |   2 +-
>  tools/testing/vma/vma.c          |  26 ++++--
>  tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
>  10 files changed, 251 insertions(+), 126 deletions(-)
>  create mode 100644 mm/vma_init.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1ee1c22e6e36..d274e6802ba5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -15656,6 +15656,7 @@ F:	mm/mseal.c
>  F:	mm/vma.c
>  F:	mm/vma.h
>  F:	mm/vma_exec.c
> +F:	mm/vma_init.c
>  F:	mm/vma_internal.h
>  F:	tools/testing/selftests/mm/merge.c
>  F:	tools/testing/vma/
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ac9f9267a473..9e4616dacd82 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
>  /* SLAB cache for fs_struct structures (tsk->fs) */
>  struct kmem_cache *fs_cachep;
>
> -/* SLAB cache for vm_area_struct structures */
> -static struct kmem_cache *vm_area_cachep;
> -
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>
> -struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -	struct vm_area_struct *vma;
> -
> -	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -	if (!vma)
> -		return NULL;
> -
> -	vma_init(vma, mm);
> -
> -	return vma;
> -}
> -
> -static void vm_area_init_from(const struct vm_area_struct *src,
> -			      struct vm_area_struct *dest)
> -{
> -	dest->vm_mm = src->vm_mm;
> -	dest->vm_ops = src->vm_ops;
> -	dest->vm_start = src->vm_start;
> -	dest->vm_end = src->vm_end;
> -	dest->anon_vma = src->anon_vma;
> -	dest->vm_pgoff = src->vm_pgoff;
> -	dest->vm_file = src->vm_file;
> -	dest->vm_private_data = src->vm_private_data;
> -	vm_flags_init(dest, src->vm_flags);
> -	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> -	       sizeof(dest->vm_page_prot));
> -	/*
> -	 * src->shared.rb may be modified concurrently when called from
> -	 * dup_mmap(), but the clone will reinitialize it.
> -	 */
> -	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> -	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> -	       sizeof(dest->vm_userfaultfd_ctx));
> -#ifdef CONFIG_ANON_VMA_NAME
> -	dest->anon_name = src->anon_name;
> -#endif
> -#ifdef CONFIG_SWAP
> -	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> -	       sizeof(dest->swap_readahead_info));
> -#endif
> -#ifndef CONFIG_MMU
> -	dest->vm_region = src->vm_region;
> -#endif
> -#ifdef CONFIG_NUMA
> -	dest->vm_policy = src->vm_policy;
> -#endif
> -}
> -
> -struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -
> -	if (!new)
> -		return NULL;
> -
> -	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> -	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> -	vm_area_init_from(orig, new);
> -	vma_lock_init(new, true);
> -	INIT_LIST_HEAD(&new->anon_vma_chain);
> -	vma_numab_state_init(new);
> -	dup_anon_vma_name(orig, new);
> -
> -	return new;
> -}
> -
> -void vm_area_free(struct vm_area_struct *vma)
> -{
> -	/* The vma should be detached while being destroyed. */
> -	vma_assert_detached(vma);
> -	vma_numab_state_free(vma);
> -	free_anon_vma_name(vma);
> -	kmem_cache_free(vm_area_cachep, vma);
> -}
> -
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> @@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
>
>  void __init proc_caches_init(void)
>  {
> -	struct kmem_cache_args args = {
> -		.use_freeptr_offset = true,
> -		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> -	};
> -
>  	sighand_cachep = kmem_cache_create("sighand_cache",
>  			sizeof(struct sighand_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -	vm_area_cachep = kmem_cache_create("vm_area_struct",
> -			sizeof(struct vm_area_struct), &args,
> -			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> -			SLAB_ACCOUNT);
>  	mmap_init();
>  	nsproxy_cache_init();
>  }
> diff --git a/mm/Makefile b/mm/Makefile
> index 15a901bb431a..690ddcf7d9a1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -55,7 +55,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   mm_init.o percpu.o slab_common.o \
>  			   compaction.o show_mem.o \
>  			   interval_tree.o list_lru.o workingset.o \
> -			   debug.o gup.o mmap_lock.o $(mmu-y)
> +			   debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
>
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5259df031e15..81dd962a1cfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = {
>  #endif /* CONFIG_SYSCTL */
>
>  /*
> - * initialise the percpu counter for VM
> + * initialise the percpu counter for VM, initialise VMA state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -1565,6 +1565,7 @@ void __init mmap_init(void)
>  #ifdef CONFIG_SYSCTL
>  	register_sysctl_init("vm", mmap_table);
>  #endif
> +	vma_state_init();
>  }
>
>  /*
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a142fc258d39..0bf4849b8204 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
>  };
>
>  /*
> - * initialise the percpu counter for VM and region record slabs
> + * initialise the percpu counter for VM and region record slabs, initialise VMA
> + * state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -409,6 +410,7 @@ void __init mmap_init(void)
>  	VM_BUG_ON(ret);
>  	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  	register_sysctl_init("vm", nommu_table);
> +	vma_state_init();
>  }
>
>  /*
> diff --git a/mm/vma.h b/mm/vma.h
> index 94307a2e4ab6..4a1e1768ca46 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -548,8 +548,15 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>
>  int __vm_munmap(unsigned long start, size_t len, bool unlock);
>
> +
>  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
>
> +/* vma_init.h, shared between CONFIG_MMU and nommu. */
> +void __init vma_state_init(void);
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
> +void vm_area_free(struct vm_area_struct *vma);
> +
>  /* vma_exec.h */
>  #ifdef CONFIG_MMU
>  int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> new file mode 100644
> index 000000000000..967ca8517986
> --- /dev/null
> +++ b/mm/vma_init.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/*
> + * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
> + * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
> + */
> +
> +#include "vma_internal.h"
> +#include "vma.h"
> +
> +/* SLAB cache for vm_area_struct structures */
> +static struct kmem_cache *vm_area_cachep;
> +
> +void __init vma_state_init(void)
> +{
> +	struct kmem_cache_args args = {
> +		.use_freeptr_offset = true,
> +		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +	};
> +
> +	vm_area_cachep = kmem_cache_create("vm_area_struct",
> +			sizeof(struct vm_area_struct), &args,
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> +			SLAB_ACCOUNT);
> +}
> +
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +	if (!vma)
> +		return NULL;
> +
> +	vma_init(vma, mm);
> +
> +	return vma;
> +}
> +
> +static void vm_area_init_from(const struct vm_area_struct *src,
> +			      struct vm_area_struct *dest)
> +{
> +	dest->vm_mm = src->vm_mm;
> +	dest->vm_ops = src->vm_ops;
> +	dest->vm_start = src->vm_start;
> +	dest->vm_end = src->vm_end;
> +	dest->anon_vma = src->anon_vma;
> +	dest->vm_pgoff = src->vm_pgoff;
> +	dest->vm_file = src->vm_file;
> +	dest->vm_private_data = src->vm_private_data;
> +	vm_flags_init(dest, src->vm_flags);
> +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +	       sizeof(dest->vm_page_prot));
> +	/*
> +	 * src->shared.rb may be modified concurrently when called from
> +	 * dup_mmap(), but the clone will reinitialize it.
> +	 */
> +	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +	       sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +	dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +	       sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +	dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> +{
> +	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +
> +	if (!new)
> +		return NULL;
> +
> +	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> +	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> +	vm_area_init_from(orig, new);
> +	vma_lock_init(new, true);
> +	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	vma_numab_state_init(new);
> +	dup_anon_vma_name(orig, new);
> +
> +	return new;
> +}
> +
> +void vm_area_free(struct vm_area_struct *vma)
> +{
> +	/* The vma should be detached while being destroyed. */
> +	vma_assert_detached(vma);
> +	vma_numab_state_free(vma);
> +	free_anon_vma_name(vma);
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> index 624040fcf193..66f3831a668f 100644
> --- a/tools/testing/vma/Makefile
> +++ b/tools/testing/vma/Makefile
> @@ -9,7 +9,7 @@ include ../shared/shared.mk
>  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
>  TARGETS = vma
>
> -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h
>
>  vma:	$(OFILES)
>  	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 5832ae5d797d..2be7597a2ac2 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
>   * Directly import the VMA implementation here. Our vma_internal.h wrapper
>   * provides userland-equivalent functionality for everything vma.c uses.
>   */
> +#include "../../../mm/vma_init.c"
>  #include "../../../mm/vma_exec.c"
>  #include "../../../mm/vma.c"
>
> @@ -91,6 +92,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
>  	return res;
>  }
>
> +static void detach_free_vma(struct vm_area_struct *vma)
> +{
> +	vma_mark_detached(vma);
> +	vm_area_free(vma);
> +}
> +
>  /* Helper function to allocate a VMA and link it to the tree. */
>  static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>  						 unsigned long start,
> @@ -104,7 +111,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>  		return NULL;
>
>  	if (attach_vma(mm, vma)) {
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		return NULL;
>  	}
>
> @@ -249,7 +256,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
>
>  	vma_iter_set(vmi, 0);
>  	for_each_vma(*vmi, vma) {
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		count++;
>  	}
>
> @@ -319,7 +326,7 @@ static bool test_simple_merge(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->vm_flags, flags);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>
>  	return true;
> @@ -361,7 +368,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x1000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	vma_iter_clear(&vmi);
>
>  	vma = vma_next(&vmi);
> @@ -370,7 +377,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x2000);
>  	ASSERT_EQ(vma->vm_pgoff, 1);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	vma_iter_clear(&vmi);
>
>  	vma = vma_next(&vmi);
> @@ -379,7 +386,7 @@ static bool test_simple_modify(void)
>  	ASSERT_EQ(vma->vm_end, 0x3000);
>  	ASSERT_EQ(vma->vm_pgoff, 2);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>
>  	return true;
> @@ -407,7 +414,7 @@ static bool test_simple_expand(void)
>  	ASSERT_EQ(vma->vm_end, 0x3000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>
>  	return true;
> @@ -428,7 +435,7 @@ static bool test_simple_shrink(void)
>  	ASSERT_EQ(vma->vm_end, 0x1000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>
> -	vm_area_free(vma);
> +	detach_free_vma(vma);
>  	mtree_destroy(&mm.mm_mt);
>
>  	return true;
> @@ -619,7 +626,7 @@ static bool test_merge_new(void)
>  		ASSERT_EQ(vma->vm_pgoff, 0);
>  		ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>
> -		vm_area_free(vma);
> +		detach_free_vma(vma);
>  		count++;
>  	}
>
> @@ -1668,6 +1675,7 @@ int main(void)
>  	int num_tests = 0, num_fail = 0;
>
>  	maple_tree_init();
> +	vma_state_init();
>
>  #define TEST(name)							\
>  	do {								\
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 32e990313158..198abe66de5a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t;
>   */
>  #define pr_warn_once pr_err
>
> +#define data_race(expr) expr
> +
> +#define ASSERT_EXCLUSIVE_WRITER(x)
> +
>  struct kref {
>  	refcount_t refcount;
>  };
> @@ -255,6 +259,8 @@ struct file {
>
>  #define VMA_LOCK_OFFSET	0x40000000
>
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  struct vm_area_struct {
>  	/* The first cache line has the info for VMA tree walking. */
>
> @@ -264,9 +270,7 @@ struct vm_area_struct {
>  			unsigned long vm_start;
>  			unsigned long vm_end;
>  		};
> -#ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> -#endif
> +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>  	};
>
>  	struct mm_struct *vm_mm;	/* The address space we belong to. */
> @@ -463,6 +467,65 @@ struct pagetable_move_control {
>  		.len_in = len_,						\
>  	}
>
> +struct kmem_cache_args {
> +	/**
> +	 * @align: The required alignment for the objects.
> +	 *
> +	 * %0 means no specific alignment is requested.
> +	 */
> +	unsigned int align;
> +	/**
> +	 * @useroffset: Usercopy region offset.
> +	 *
> +	 * %0 is a valid offset, when @usersize is non-%0
> +	 */
> +	unsigned int useroffset;
> +	/**
> +	 * @usersize: Usercopy region size.
> +	 *
> +	 * %0 means no usercopy region is specified.
> +	 */
> +	unsigned int usersize;
> +	/**
> +	 * @freeptr_offset: Custom offset for the free pointer
> +	 * in &SLAB_TYPESAFE_BY_RCU caches
> +	 *
> +	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> +	 * outside of the object. This might cause the object to grow in size.
> +	 * Cache creators that have a reason to avoid this can specify a custom
> +	 * free pointer offset in their struct where the free pointer will be
> +	 * placed.
> +	 *
> +	 * Note that placing the free pointer inside the object requires the
> +	 * caller to ensure that no fields are invalidated that are required to
> +	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> +	 * details).
> +	 *
> +	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> +	 * is specified, %use_freeptr_offset must be set %true.
> +	 *
> +	 * Note that @ctor currently isn't supported with custom free pointers
> +	 * as a @ctor requires an external free pointer.
> +	 */
> +	unsigned int freeptr_offset;
> +	/**
> +	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
> +	 */
> +	bool use_freeptr_offset;
> +	/**
> +	 * @ctor: A constructor for the objects.
> +	 *
> +	 * The constructor is invoked for each object in a newly allocated slab
> +	 * page. It is the cache user's responsibility to free object in the
> +	 * same state as after calling the constructor, or deal appropriately
> +	 * with any differences between a freshly constructed and a reallocated
> +	 * object.
> +	 *
> +	 * %NULL means no constructor.
> +	 */
> +	void (*ctor)(void *);
> +};
> +
>  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
>  {
>  	mas_pause(&vmi->mas);
> @@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_lock_seq = UINT_MAX;
>  }
>
> -static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
> +struct kmem_cache {
> +	const char *name;
> +	size_t object_size;
> +	struct kmem_cache_args *args;
> +};
>
> -	if (!vma)
> -		return NULL;
> +static inline struct kmem_cache *__kmem_cache_create(const char *name,
> +						     size_t object_size,
> +						     struct kmem_cache_args *args)
> +{
> +	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
>
> -	vma_init(vma, mm);
> +	ret->name = name;
> +	ret->object_size = object_size;
> +	ret->args = args;
>
> -	return vma;
> +	return ret;
>  }
>
> -static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +	__kmem_cache_create((__name), (__object_size), (__args))
>
> -	if (!new)
> -		return NULL;
> +static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	(void)gfpflags;
>
> -	memcpy(new, orig, sizeof(*new));
> -	refcount_set(&new->vm_refcnt, 0);
> -	new->vm_lock_seq = UINT_MAX;
> -	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	return calloc(s->object_size, 1);
> +}
>
> -	return new;
> +static inline void kmem_cache_free(struct kmem_cache *s, void *x)
> +{
> +	free(x);
>  }
>
>  /*
> @@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *)
>  {
>  }
>
> -static inline void vm_area_free(struct vm_area_struct *vma)
> -{
> -	free(vma);
> -}
> -
>  static inline void lru_add_drain(void)
>  {
>  }
> @@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm)
>  	(void)mm;
>  }
>
> +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> +{
> +	(void)vma;
> +	(void)reset_refcnt;
> +}
> +
> +static inline void vma_numab_state_init(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
> +static inline void vma_numab_state_free(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
> +static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
> +				     struct vm_area_struct *new_vma)
> +{
> +	(void)orig_vma;
> +	(void)new_vma;
> +}
> +
> +static inline void free_anon_vma_name(struct vm_area_struct *vma)
> +{
> +	(void)vma;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2025-04-28 19:14   ` Liam R. Howlett
  2025-04-28 20:29   ` Lorenzo Stoakes
@ 2025-04-29  7:22   ` Vlastimil Babka
  2025-04-30  9:20     ` Lorenzo Stoakes
  2025-04-29 15:04   ` Suren Baghdasaryan
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-04-29  7:22 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, David Hildenbrand,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 4/28/25 17:28, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.

I wonder if the last part is from an earlier version and now obsolete
because there's not subsequent commit in this series and the placement of
alloc/freeing in vma_init.c seems making those purely internal mm operations
already? Or do you mean some further plans?

> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
> 
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
> 
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
> 
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-29  7:22   ` Vlastimil Babka
@ 2025-04-30  9:20     ` Lorenzo Stoakes
  2025-04-30 21:42       ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-30  9:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Apr 29, 2025 at 09:22:59AM +0200, Vlastimil Babka wrote:
> On 4/28/25 17:28, Lorenzo Stoakes wrote:
> > Right now these are performed in kernel/fork.c which is odd and a violation
> > of separation of concerns, as well as preventing us from integrating this
> > and related logic into userland VMA testing going forward, and perhaps more
> > importantly - enabling us to, in a subsequent commit, make VMA
> > allocation/freeing a purely internal mm operation.
>
> I wonder if the last part is from an earlier version and now obsolete
> because there's not subsequent commit in this series and the placement of
> alloc/freeing in vma_init.c seems making those purely internal mm operations
> already? Or do you mean some further plans?
>

Sorry, missed this!

Andrew - could we delete the last part of this sentence so it reads:

Right now these are performed in kernel/fork.c which is odd and a violation
of separation of concerns, as well as preventing us from integrating this
and related logic into userland VMA testing going forward.

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-30  9:20     ` Lorenzo Stoakes
@ 2025-04-30 21:42       ` Andrew Morton
  2025-05-01 10:38         ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2025-04-30 21:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Liam R . Howlett, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Wed, 30 Apr 2025 10:20:10 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Tue, Apr 29, 2025 at 09:22:59AM +0200, Vlastimil Babka wrote:
> > On 4/28/25 17:28, Lorenzo Stoakes wrote:
> > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > of separation of concerns, as well as preventing us from integrating this
> > > and related logic into userland VMA testing going forward, and perhaps more
> > > importantly - enabling us to, in a subsequent commit, make VMA
> > > allocation/freeing a purely internal mm operation.
> >
> > I wonder if the last part is from an earlier version and now obsolete
> > because there's not subsequent commit in this series and the placement of
> > alloc/freeing in vma_init.c seems making those purely internal mm operations
> > already? Or do you mean some further plans?
> >
> 
> Sorry, missed this!
> 
> Andrew - could we delete the last part of this sentence so it reads:
> 
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward.

Sure.  The result:

: Right now these are performed in kernel/fork.c which is odd and a
: violation of separation of concerns, as well as preventing us from
: integrating this and related logic into userland VMA testing going
: forward.
: 
: There is a fly in the ointment - nommu - mmap.c is not compiled if
: CONFIG_MMU not set, and neither is vma.c.
: 
: To square the circle, let's add a new file - vma_init.c.  This will be
: compiled for both CONFIG_MMU and nommu builds, and will also form part of
: the VMA userland testing.
: 
: This allows us to de-duplicate code, while maintaining separation of
: concerns and the ability for us to userland test this logic.
: 
: Update the VMA userland tests accordingly, additionally adding a
: detach_free_vma() helper function to correctly detach VMAs before freeing
: them in test code, as this change was triggering the assert for this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-30 21:42       ` Andrew Morton
@ 2025-05-01 10:38         ` Lorenzo Stoakes
  0 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-05-01 10:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Liam R . Howlett, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Wed, Apr 30, 2025 at 02:42:36PM -0700, Andrew Morton wrote:
> On Wed, 30 Apr 2025 10:20:10 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Tue, Apr 29, 2025 at 09:22:59AM +0200, Vlastimil Babka wrote:
> > > On 4/28/25 17:28, Lorenzo Stoakes wrote:
> > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > of separation of concerns, as well as preventing us from integrating this
> > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > allocation/freeing a purely internal mm operation.
> > >
> > > I wonder if the last part is from an earlier version and now obsolete
> > > because there's not subsequent commit in this series and the placement of
> > > alloc/freeing in vma_init.c seems making those purely internal mm operations
> > > already? Or do you mean some further plans?
> > >
> >
> > Sorry, missed this!
> >
> > Andrew - could we delete the last part of this sentence so it reads:
> >
> > Right now these are performed in kernel/fork.c which is odd and a violation
> > of separation of concerns, as well as preventing us from integrating this
> > and related logic into userland VMA testing going forward.
>
> Sure.  The result:
>
> : Right now these are performed in kernel/fork.c which is odd and a
> : violation of separation of concerns, as well as preventing us from
> : integrating this and related logic into userland VMA testing going
> : forward.
> :
> : There is a fly in the ointment - nommu - mmap.c is not compiled if
> : CONFIG_MMU not set, and neither is vma.c.
> :
> : To square the circle, let's add a new file - vma_init.c.  This will be
> : compiled for both CONFIG_MMU and nommu builds, and will also form part of
> : the VMA userland testing.
> :
> : This allows us to de-duplicate code, while maintaining separation of
> : concerns and the ability for us to userland test this logic.
> :
> : Update the VMA userland tests accordingly, additionally adding a
> : detach_free_vma() helper function to correctly detach VMAs before freeing
> : them in test code, as this change was triggering the assert for this.
>

Perfect, thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2025-04-29  7:22   ` Vlastimil Babka
@ 2025-04-29 15:04   ` Suren Baghdasaryan
  2025-04-29 16:54   ` Kees Cook
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 15:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Apr 28, 2025 at 8:31 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
>
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
>
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
>
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
>
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.

Great! We are getting closer to parity between tests and the kernel code.

>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  MAINTAINERS                      |   1 +
>  kernel/fork.c                    |  88 -------------------
>  mm/Makefile                      |   2 +-
>  mm/mmap.c                        |   3 +-
>  mm/nommu.c                       |   4 +-
>  mm/vma.h                         |   7 ++
>  mm/vma_init.c                    | 101 ++++++++++++++++++++++
>  tools/testing/vma/Makefile       |   2 +-
>  tools/testing/vma/vma.c          |  26 ++++--
>  tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
>  10 files changed, 251 insertions(+), 126 deletions(-)
>  create mode 100644 mm/vma_init.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1ee1c22e6e36..d274e6802ba5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -15656,6 +15656,7 @@ F:      mm/mseal.c
>  F:     mm/vma.c
>  F:     mm/vma.h
>  F:     mm/vma_exec.c
> +F:     mm/vma_init.c
>  F:     mm/vma_internal.h
>  F:     tools/testing/selftests/mm/merge.c
>  F:     tools/testing/vma/
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ac9f9267a473..9e4616dacd82 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
>  /* SLAB cache for fs_struct structures (tsk->fs) */
>  struct kmem_cache *fs_cachep;
>
> -/* SLAB cache for vm_area_struct structures */
> -static struct kmem_cache *vm_area_cachep;
> -
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;

Maybe at some point we will be able to move mm_cachep out of here as well?

>
> -struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -       struct vm_area_struct *vma;
> -
> -       vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -       if (!vma)
> -               return NULL;
> -
> -       vma_init(vma, mm);
> -
> -       return vma;
> -}
> -
> -static void vm_area_init_from(const struct vm_area_struct *src,
> -                             struct vm_area_struct *dest)
> -{
> -       dest->vm_mm = src->vm_mm;
> -       dest->vm_ops = src->vm_ops;
> -       dest->vm_start = src->vm_start;
> -       dest->vm_end = src->vm_end;
> -       dest->anon_vma = src->anon_vma;
> -       dest->vm_pgoff = src->vm_pgoff;
> -       dest->vm_file = src->vm_file;
> -       dest->vm_private_data = src->vm_private_data;
> -       vm_flags_init(dest, src->vm_flags);
> -       memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> -              sizeof(dest->vm_page_prot));
> -       /*
> -        * src->shared.rb may be modified concurrently when called from
> -        * dup_mmap(), but the clone will reinitialize it.
> -        */
> -       data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> -       memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> -              sizeof(dest->vm_userfaultfd_ctx));
> -#ifdef CONFIG_ANON_VMA_NAME
> -       dest->anon_name = src->anon_name;
> -#endif
> -#ifdef CONFIG_SWAP
> -       memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> -              sizeof(dest->swap_readahead_info));
> -#endif
> -#ifndef CONFIG_MMU
> -       dest->vm_region = src->vm_region;
> -#endif
> -#ifdef CONFIG_NUMA
> -       dest->vm_policy = src->vm_policy;
> -#endif
> -}
> -
> -struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -       struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> -
> -       if (!new)
> -               return NULL;
> -
> -       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> -       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> -       vm_area_init_from(orig, new);
> -       vma_lock_init(new, true);
> -       INIT_LIST_HEAD(&new->anon_vma_chain);
> -       vma_numab_state_init(new);
> -       dup_anon_vma_name(orig, new);
> -
> -       return new;
> -}
> -
> -void vm_area_free(struct vm_area_struct *vma)
> -{
> -       /* The vma should be detached while being destroyed. */
> -       vma_assert_detached(vma);
> -       vma_numab_state_free(vma);
> -       free_anon_vma_name(vma);
> -       kmem_cache_free(vm_area_cachep, vma);
> -}
> -
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>         if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> @@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
>
>  void __init proc_caches_init(void)
>  {
> -       struct kmem_cache_args args = {
> -               .use_freeptr_offset = true,
> -               .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> -       };
> -
>         sighand_cachep = kmem_cache_create("sighand_cache",
>                         sizeof(struct sighand_struct), 0,
>                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
>                         sizeof(struct fs_struct), 0,
>                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>                         NULL);
> -       vm_area_cachep = kmem_cache_create("vm_area_struct",
> -                       sizeof(struct vm_area_struct), &args,
> -                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> -                       SLAB_ACCOUNT);
>         mmap_init();
>         nsproxy_cache_init();
>  }
> diff --git a/mm/Makefile b/mm/Makefile
> index 15a901bb431a..690ddcf7d9a1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -55,7 +55,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            mm_init.o percpu.o slab_common.o \
>                            compaction.o show_mem.o \
>                            interval_tree.o list_lru.o workingset.o \
> -                          debug.o gup.o mmap_lock.o $(mmu-y)
> +                          debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
>
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5259df031e15..81dd962a1cfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1554,7 +1554,7 @@ static const struct ctl_table mmap_table[] = {
>  #endif /* CONFIG_SYSCTL */
>
>  /*
> - * initialise the percpu counter for VM
> + * initialise the percpu counter for VM, initialise VMA state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -1565,6 +1565,7 @@ void __init mmap_init(void)
>  #ifdef CONFIG_SYSCTL
>         register_sysctl_init("vm", mmap_table);
>  #endif
> +       vma_state_init();
>  }
>
>  /*
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a142fc258d39..0bf4849b8204 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
>  };
>
>  /*
> - * initialise the percpu counter for VM and region record slabs
> + * initialise the percpu counter for VM and region record slabs, initialise VMA
> + * state.
>   */
>  void __init mmap_init(void)
>  {
> @@ -409,6 +410,7 @@ void __init mmap_init(void)
>         VM_BUG_ON(ret);
>         vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>         register_sysctl_init("vm", nommu_table);
> +       vma_state_init();
>  }
>
>  /*
> diff --git a/mm/vma.h b/mm/vma.h
> index 94307a2e4ab6..4a1e1768ca46 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -548,8 +548,15 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>
>  int __vm_munmap(unsigned long start, size_t len, bool unlock);
>
> +
>  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma);
>
> +/* vma_init.h, shared between CONFIG_MMU and nommu. */
> +void __init vma_state_init(void);
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
> +void vm_area_free(struct vm_area_struct *vma);
> +
>  /* vma_exec.h */
>  #ifdef CONFIG_MMU
>  int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> new file mode 100644
> index 000000000000..967ca8517986
> --- /dev/null
> +++ b/mm/vma_init.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/*
> + * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
> + * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
> + */
> +
> +#include "vma_internal.h"
> +#include "vma.h"
> +
> +/* SLAB cache for vm_area_struct structures */
> +static struct kmem_cache *vm_area_cachep;
> +
> +void __init vma_state_init(void)
> +{
> +       struct kmem_cache_args args = {
> +               .use_freeptr_offset = true,
> +               .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +       };
> +
> +       vm_area_cachep = kmem_cache_create("vm_area_struct",
> +                       sizeof(struct vm_area_struct), &args,
> +                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> +                       SLAB_ACCOUNT);
> +}
> +
> +struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> +{
> +       struct vm_area_struct *vma;
> +
> +       vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +       if (!vma)
> +               return NULL;
> +
> +       vma_init(vma, mm);
> +
> +       return vma;
> +}
> +
> +static void vm_area_init_from(const struct vm_area_struct *src,
> +                             struct vm_area_struct *dest)
> +{
> +       dest->vm_mm = src->vm_mm;
> +       dest->vm_ops = src->vm_ops;
> +       dest->vm_start = src->vm_start;
> +       dest->vm_end = src->vm_end;
> +       dest->anon_vma = src->anon_vma;
> +       dest->vm_pgoff = src->vm_pgoff;
> +       dest->vm_file = src->vm_file;
> +       dest->vm_private_data = src->vm_private_data;
> +       vm_flags_init(dest, src->vm_flags);
> +       memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +              sizeof(dest->vm_page_prot));
> +       /*
> +        * src->shared.rb may be modified concurrently when called from
> +        * dup_mmap(), but the clone will reinitialize it.
> +        */
> +       data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> +       memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +              sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +       dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +       memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +              sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +       dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +       dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> +{
> +       struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> +
> +       if (!new)
> +               return NULL;
> +
> +       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> +       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> +       vm_area_init_from(orig, new);
> +       vma_lock_init(new, true);
> +       INIT_LIST_HEAD(&new->anon_vma_chain);
> +       vma_numab_state_init(new);
> +       dup_anon_vma_name(orig, new);
> +
> +       return new;
> +}
> +
> +void vm_area_free(struct vm_area_struct *vma)
> +{
> +       /* The vma should be detached while being destroyed. */
> +       vma_assert_detached(vma);
> +       vma_numab_state_free(vma);
> +       free_anon_vma_name(vma);
> +       kmem_cache_free(vm_area_cachep, vma);
> +}
> diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
> index 624040fcf193..66f3831a668f 100644
> --- a/tools/testing/vma/Makefile
> +++ b/tools/testing/vma/Makefile
> @@ -9,7 +9,7 @@ include ../shared/shared.mk
>  OFILES = $(SHARED_OFILES) vma.o maple-shim.o
>  TARGETS = vma
>
> -vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_exec.c ../../../mm/vma.h
> +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma_exec.c ../../../mm/vma.h
>
>  vma:   $(OFILES)
>         $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 5832ae5d797d..2be7597a2ac2 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
>   * Directly import the VMA implementation here. Our vma_internal.h wrapper
>   * provides userland-equivalent functionality for everything vma.c uses.
>   */
> +#include "../../../mm/vma_init.c"
>  #include "../../../mm/vma_exec.c"
>  #include "../../../mm/vma.c"
>
> @@ -91,6 +92,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
>         return res;
>  }
>
> +static void detach_free_vma(struct vm_area_struct *vma)

In case you respin another version, I think this change related to
detach_free_vma() would better be done in a separate patch. But I'm
totally fine with the way it is now, so don't respin just for that.

> +{
> +       vma_mark_detached(vma);
> +       vm_area_free(vma);
> +}
> +
>  /* Helper function to allocate a VMA and link it to the tree. */
>  static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>                                                  unsigned long start,
> @@ -104,7 +111,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
>                 return NULL;
>
>         if (attach_vma(mm, vma)) {
> -               vm_area_free(vma);
> +               detach_free_vma(vma);
>                 return NULL;
>         }
>
> @@ -249,7 +256,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
>
>         vma_iter_set(vmi, 0);
>         for_each_vma(*vmi, vma) {
> -               vm_area_free(vma);
> +               detach_free_vma(vma);
>                 count++;
>         }
>
> @@ -319,7 +326,7 @@ static bool test_simple_merge(void)
>         ASSERT_EQ(vma->vm_pgoff, 0);
>         ASSERT_EQ(vma->vm_flags, flags);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         mtree_destroy(&mm.mm_mt);
>
>         return true;
> @@ -361,7 +368,7 @@ static bool test_simple_modify(void)
>         ASSERT_EQ(vma->vm_end, 0x1000);
>         ASSERT_EQ(vma->vm_pgoff, 0);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         vma_iter_clear(&vmi);
>
>         vma = vma_next(&vmi);
> @@ -370,7 +377,7 @@ static bool test_simple_modify(void)
>         ASSERT_EQ(vma->vm_end, 0x2000);
>         ASSERT_EQ(vma->vm_pgoff, 1);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         vma_iter_clear(&vmi);
>
>         vma = vma_next(&vmi);
> @@ -379,7 +386,7 @@ static bool test_simple_modify(void)
>         ASSERT_EQ(vma->vm_end, 0x3000);
>         ASSERT_EQ(vma->vm_pgoff, 2);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         mtree_destroy(&mm.mm_mt);
>
>         return true;
> @@ -407,7 +414,7 @@ static bool test_simple_expand(void)
>         ASSERT_EQ(vma->vm_end, 0x3000);
>         ASSERT_EQ(vma->vm_pgoff, 0);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         mtree_destroy(&mm.mm_mt);
>
>         return true;
> @@ -428,7 +435,7 @@ static bool test_simple_shrink(void)
>         ASSERT_EQ(vma->vm_end, 0x1000);
>         ASSERT_EQ(vma->vm_pgoff, 0);
>
> -       vm_area_free(vma);
> +       detach_free_vma(vma);
>         mtree_destroy(&mm.mm_mt);
>
>         return true;
> @@ -619,7 +626,7 @@ static bool test_merge_new(void)
>                 ASSERT_EQ(vma->vm_pgoff, 0);
>                 ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>
> -               vm_area_free(vma);
> +               detach_free_vma(vma);
>                 count++;
>         }
>
> @@ -1668,6 +1675,7 @@ int main(void)
>         int num_tests = 0, num_fail = 0;
>
>         maple_tree_init();
> +       vma_state_init();
>
>  #define TEST(name)                                                     \
>         do {                                                            \
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 32e990313158..198abe66de5a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -155,6 +155,10 @@ typedef __bitwise unsigned int vm_fault_t;
>   */
>  #define pr_warn_once pr_err
>
> +#define data_race(expr) expr
> +
> +#define ASSERT_EXCLUSIVE_WRITER(x)
> +
>  struct kref {
>         refcount_t refcount;
>  };
> @@ -255,6 +259,8 @@ struct file {
>
>  #define VMA_LOCK_OFFSET        0x40000000
>
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  struct vm_area_struct {
>         /* The first cache line has the info for VMA tree walking. */
>
> @@ -264,9 +270,7 @@ struct vm_area_struct {
>                         unsigned long vm_start;
>                         unsigned long vm_end;
>                 };
> -#ifdef CONFIG_PER_VMA_LOCK
> -               struct rcu_head vm_rcu; /* Used for deferred freeing. */
> -#endif
> +               freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */

Oops, I must have missed this when adding SLAB_TYPESAFE_BY_RCU to the
vm_area_struct cache. Thanks for fixing it!

>         };
>
>         struct mm_struct *vm_mm;        /* The address space we belong to. */
> @@ -463,6 +467,65 @@ struct pagetable_move_control {
>                 .len_in = len_,                                         \
>         }
>
> +struct kmem_cache_args {
> +       /**
> +        * @align: The required alignment for the objects.
> +        *
> +        * %0 means no specific alignment is requested.
> +        */
> +       unsigned int align;
> +       /**
> +        * @useroffset: Usercopy region offset.
> +        *
> +        * %0 is a valid offset, when @usersize is non-%0
> +        */
> +       unsigned int useroffset;
> +       /**
> +        * @usersize: Usercopy region size.
> +        *
> +        * %0 means no usercopy region is specified.
> +        */
> +       unsigned int usersize;
> +       /**
> +        * @freeptr_offset: Custom offset for the free pointer
> +        * in &SLAB_TYPESAFE_BY_RCU caches
> +        *
> +        * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> +        * outside of the object. This might cause the object to grow in size.
> +        * Cache creators that have a reason to avoid this can specify a custom
> +        * free pointer offset in their struct where the free pointer will be
> +        * placed.
> +        *
> +        * Note that placing the free pointer inside the object requires the
> +        * caller to ensure that no fields are invalidated that are required to
> +        * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> +        * details).
> +        *
> +        * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> +        * is specified, %use_freeptr_offset must be set %true.
> +        *
> +        * Note that @ctor currently isn't supported with custom free pointers
> +        * as a @ctor requires an external free pointer.
> +        */
> +       unsigned int freeptr_offset;
> +       /**
> +        * @use_freeptr_offset: Whether a @freeptr_offset is used.
> +        */
> +       bool use_freeptr_offset;
> +       /**
> +        * @ctor: A constructor for the objects.
> +        *
> +        * The constructor is invoked for each object in a newly allocated slab
> +        * page. It is the cache user's responsibility to free object in the
> +        * same state as after calling the constructor, or deal appropriately
> +        * with any differences between a freshly constructed and a reallocated
> +        * object.
> +        *
> +        * %NULL means no constructor.
> +        */
> +       void (*ctor)(void *);
> +};
> +
>  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
>  {
>         mas_pause(&vmi->mas);
> @@ -547,31 +610,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>         vma->vm_lock_seq = UINT_MAX;
>  }
>
> -static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> -{
> -       struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
> +struct kmem_cache {
> +       const char *name;
> +       size_t object_size;
> +       struct kmem_cache_args *args;
> +};
>
> -       if (!vma)
> -               return NULL;
> +static inline struct kmem_cache *__kmem_cache_create(const char *name,
> +                                                    size_t object_size,
> +                                                    struct kmem_cache_args *args)
> +{
> +       struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
>
> -       vma_init(vma, mm);
> +       ret->name = name;
> +       ret->object_size = object_size;
> +       ret->args = args;
>
> -       return vma;
> +       return ret;
>  }
>
> -static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> -{
> -       struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +       __kmem_cache_create((__name), (__object_size), (__args))
>
> -       if (!new)
> -               return NULL;
> +static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +       (void)gfpflags;
>
> -       memcpy(new, orig, sizeof(*new));
> -       refcount_set(&new->vm_refcnt, 0);
> -       new->vm_lock_seq = UINT_MAX;
> -       INIT_LIST_HEAD(&new->anon_vma_chain);
> +       return calloc(s->object_size, 1);
> +}
>
> -       return new;
> +static inline void kmem_cache_free(struct kmem_cache *s, void *x)
> +{
> +       free(x);
>  }
>
>  /*
> @@ -738,11 +808,6 @@ static inline void mpol_put(struct mempolicy *)
>  {
>  }
>
> -static inline void vm_area_free(struct vm_area_struct *vma)
> -{
> -       free(vma);
> -}
> -
>  static inline void lru_add_drain(void)
>  {
>  }
> @@ -1312,4 +1377,32 @@ static inline void ksm_exit(struct mm_struct *mm)
>         (void)mm;
>  }
>
> +static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> +{
> +       (void)vma;
> +       (void)reset_refcnt;
> +}
> +
> +static inline void vma_numab_state_init(struct vm_area_struct *vma)
> +{
> +       (void)vma;
> +}
> +
> +static inline void vma_numab_state_free(struct vm_area_struct *vma)
> +{
> +       (void)vma;
> +}
> +
> +static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
> +                                    struct vm_area_struct *new_vma)
> +{
> +       (void)orig_vma;
> +       (void)new_vma;
> +}
> +
> +static inline void free_anon_vma_name(struct vm_area_struct *vma)
> +{
> +       (void)vma;
> +}
> +
>  #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
                     ` (3 preceding siblings ...)
  2025-04-29 15:04   ` Suren Baghdasaryan
@ 2025-04-29 16:54   ` Kees Cook
  2025-04-29 17:24   ` David Hildenbrand
  2025-04-29 17:47   ` Pedro Falcato
  6 siblings, 0 replies; 36+ messages in thread
From: Kees Cook @ 2025-04-29 16:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 04:28:17PM +0100, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
> 
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
> 
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
> 
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
> 
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Kees Cook <kees@kernel.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
                     ` (4 preceding siblings ...)
  2025-04-29 16:54   ` Kees Cook
@ 2025-04-29 17:24   ` David Hildenbrand
  2025-04-29 17:47   ` Pedro Falcato
  6 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-04-29 17:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 28.04.25 17:28, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
> 
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
> 
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
> 
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
> 
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
                     ` (5 preceding siblings ...)
  2025-04-29 17:24   ` David Hildenbrand
@ 2025-04-29 17:47   ` Pedro Falcato
  6 siblings, 0 replies; 36+ messages in thread
From: Pedro Falcato @ 2025-04-29 17:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Apr 28, 2025 at 04:28:17PM +0100, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
> 
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU not set, and neither is vma.c.
> 
> To square the circle, let's add a new file - vma_init.c. This will be
> compiled for both CONFIG_MMU and nommu builds, and will also form part of
> the VMA userland testing.
> 
> This allows us to de-duplicate code, while maintaining separation of
> concerns and the ability for us to userland test this logic.
> 
> Update the VMA userland tests accordingly, additionally adding a
> detach_free_vma() helper function to correctly detach VMAs before freeing
> them in test code, as this change was triggering the assert for this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm
  2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
@ 2025-04-29  7:28 ` Vlastimil Babka
  2025-04-29 10:23   ` Lorenzo Stoakes
  4 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-04-29  7:28 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, David Hildenbrand,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 4/28/25 17:28, Lorenzo Stoakes wrote:
> Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
> which is a violation of separation of concerns, and leaves these functions
> exposed to the rest of the kernel when they are in fact internal
> implementation details.
> 
> Resolve this by moving this logic to mm, and making it internal to vma.c,
> vma.h.
> 
> This also allows us, in future, to provide userland testing around this
> functionality.
> 
> We additionally abstract dup_mmap() to mm, being careful to ensure
> kernel/fork.c acceses this via the mm internal header so it is not exposed
> elsewhere in the kernel.
> 
> As part of this change, also abstract initial stack allocation performed in
> __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
> this code uses vm_area_alloc() and vm_area_free().
> 
> In order to do so sensibly, we introduce a new mm/vma_exec.c file, which
> contains the code that is shared by mm and exec. This file is added to both
> memory mapping and exec sections in MAINTAINERS so both sets of maintainers
> can maintain oversight.

Note that kernel/fork.c itself belongs to no section. Maybe we could put it
somewhere too, maybe also multiple subsystems? I'm thinking something
between MM, SCHEDULER, EXEC, perhaps PIDFD?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm
  2025-04-29  7:28 ` [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Vlastimil Babka
@ 2025-04-29 10:23   ` Lorenzo Stoakes
  2025-04-29 16:55     ` Kees Cook
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 10:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Apr 29, 2025 at 09:28:05AM +0200, Vlastimil Babka wrote:
> On 4/28/25 17:28, Lorenzo Stoakes wrote:
> > Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
> > which is a violation of separation of concerns, and leaves these functions
> > exposed to the rest of the kernel when they are in fact internal
> > implementation details.
> >
> > Resolve this by moving this logic to mm, and making it internal to vma.c,
> > vma.h.
> >
> > This also allows us, in future, to provide userland testing around this
> > functionality.
> >
> > We additionally abstract dup_mmap() to mm, being careful to ensure
> > kernel/fork.c acceses this via the mm internal header so it is not exposed
> > elsewhere in the kernel.
> >
> > As part of this change, also abstract initial stack allocation performed in
> > __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
> > this code uses vm_area_alloc() and vm_area_free().
> >
> > In order to do so sensibly, we introduce a new mm/vma_exec.c file, which
> > contains the code that is shared by mm and exec. This file is added to both
> > memory mapping and exec sections in MAINTAINERS so both sets of maintainers
> > can maintain oversight.
>
> Note that kernel/fork.c itself belongs to no section. Maybe we could put it
> somewhere too, maybe also multiple subsystems? I'm thinking something
> between MM, SCHEDULER, EXEC, perhaps PIDFD?

Thanks, indeed I was wondering about where this should be, and the fact we can
put stuff in multiple places is actually pretty powerful!

This is on my todo, will take a look at this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm
  2025-04-29 10:23   ` Lorenzo Stoakes
@ 2025-04-29 16:55     ` Kees Cook
  0 siblings, 0 replies; 36+ messages in thread
From: Kees Cook @ 2025-04-29 16:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Liam R . Howlett, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Apr 29, 2025 at 11:23:25AM +0100, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 09:28:05AM +0200, Vlastimil Babka wrote:
> > On 4/28/25 17:28, Lorenzo Stoakes wrote:
> > > Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
> > > which is a violation of separation of concerns, and leaves these functions
> > > exposed to the rest of the kernel when they are in fact internal
> > > implementation details.
> > >
> > > Resolve this by moving this logic to mm, and making it internal to vma.c,
> > > vma.h.
> > >
> > > This also allows us, in future, to provide userland testing around this
> > > functionality.
> > >
> > > We additionally abstract dup_mmap() to mm, being careful to ensure
> > > kernel/fork.c acceses this via the mm internal header so it is not exposed
> > > elsewhere in the kernel.
> > >
> > > As part of this change, also abstract initial stack allocation performed in
> > > __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
> > > this code uses vm_area_alloc() and vm_area_free().
> > >
> > > In order to do so sensibly, we introduce a new mm/vma_exec.c file, which
> > > contains the code that is shared by mm and exec. This file is added to both
> > > memory mapping and exec sections in MAINTAINERS so both sets of maintainers
> > > can maintain oversight.
> >
> > Note that kernel/fork.c itself belongs to no section. Maybe we could put it
> > somewhere too, maybe also multiple subsystems? I'm thinking something
> > between MM, SCHEDULER, EXEC, perhaps PIDFD?
> 
> Thanks, indeed I was wondering about where this should be, and the fact we can
> put stuff in multiple places is actually pretty powerful!
> 
> This is on my todo, will take a look at this.

Yeah, I'd be interested in having fork.c multi-maintainer-sectioned with
EXEC/BINFMT too, when the time comes.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-05-01 10:38 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28 15:28 [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
2025-04-28 15:28 ` [PATCH v3 1/4] mm: establish mm/vma_exec.c for shared exec/mm VMA functionality Lorenzo Stoakes
2025-04-28 19:12   ` Liam R. Howlett
2025-04-28 20:14     ` Suren Baghdasaryan
2025-04-28 20:26       ` Lorenzo Stoakes
2025-04-28 23:08         ` Andrew Morton
2025-04-29  6:59   ` Vlastimil Babka
2025-04-29 16:53   ` Kees Cook
2025-04-29 17:22   ` David Hildenbrand
2025-04-29 17:48   ` Pedro Falcato
2025-04-28 15:28 ` [PATCH v3 2/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
2025-04-28 19:12   ` Liam R. Howlett
2025-04-29  7:04   ` Vlastimil Babka
2025-04-29 16:54   ` Kees Cook
2025-04-29 17:48   ` Pedro Falcato
2025-04-28 15:28 ` [PATCH v3 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
2025-04-28 19:12   ` Liam R. Howlett
2025-04-28 23:31     ` Suren Baghdasaryan
2025-04-29  7:12   ` Vlastimil Babka
2025-04-29 16:54   ` Kees Cook
2025-04-29 17:23   ` David Hildenbrand
2025-04-28 15:28 ` [PATCH v3 4/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
2025-04-28 19:14   ` Liam R. Howlett
2025-04-28 20:28     ` Lorenzo Stoakes
2025-04-28 20:29   ` Lorenzo Stoakes
2025-04-29  7:22   ` Vlastimil Babka
2025-04-30  9:20     ` Lorenzo Stoakes
2025-04-30 21:42       ` Andrew Morton
2025-05-01 10:38         ` Lorenzo Stoakes
2025-04-29 15:04   ` Suren Baghdasaryan
2025-04-29 16:54   ` Kees Cook
2025-04-29 17:24   ` David Hildenbrand
2025-04-29 17:47   ` Pedro Falcato
2025-04-29  7:28 ` [PATCH v3 0/4] move all VMA allocation, freeing and duplication logic to mm Vlastimil Babka
2025-04-29 10:23   ` Lorenzo Stoakes
2025-04-29 16:55     ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).