linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] move all VMA allocation, freeing and duplication logic to mm
@ 2025-04-25 14:54 Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
which is a violation of separation of concerns, and leaves these functions
exposed to the rest of the kernel when they are in fact internal
implementation details.

Resolve this by moving this logic to mm, and making it internal to vma.c,
vma.h.

This also allows us, in future, to provide userland testing around this
functionality.

We additionally abstract dup_mmap() to mm, being careful to ensure
kernel/fork.c acceses this via the mm internal header so it is not exposed
elsewhere in the kernel.

As part of this change, also abstract initial stack allocation performed in
__bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
this code uses vm_area_alloc() and vm_area_free().

In order to allow for this, we must add code shared between nommu and
mmu-enabled configurations in order to share VMA allocation, freeing and
duplication code correctly while also keeping these functions available in
userland VMA testing.

This is achieved by adding a vma_init.c file which is also compiled by the
userland tests.

v2:
* Moved vma init, alloc, free, dup functions to newly created vma_init.c
  function as per Suren, Liam.
* Added MAINTAINERS entry for vma_init.c, added to Makefile.
* Updated mmap_init() comment.
* Propagated tags (thanks everyone!)
* Added detach_free_vma() helper and correctly detached vmas in userland VMA
  test code.
* Updated userland test code to also compile the vma_init.c file.
* Corrected create_init_stack_vma() comment as per Suren.
* Updated commit message as per Suren.

v1:
https://lore.kernel.org/all/cover.1745528282.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (3):
  mm: abstract initial stack setup to mm subsystem
  mm: move dup_mmap() to mm
  mm: perform VMA allocation, freeing, duplication in mm

 MAINTAINERS                      |   1 +
 fs/exec.c                        |  51 +-----
 include/linux/mm.h               |   2 +
 kernel/fork.c                    | 277 +------------------------------
 mm/Makefile                      |   2 +-
 mm/internal.h                    |   2 +
 mm/mmap.c                        | 253 +++++++++++++++++++++++++++-
 mm/nommu.c                       |  12 +-
 mm/vma.h                         |   6 +
 mm/vma_init.c                    | 101 +++++++++++
 tools/testing/vma/Makefile       |   2 +-
 tools/testing/vma/vma.c          |  26 ++-
 tools/testing/vma/vma_internal.h | 143 +++++++++++++---
 13 files changed, 511 insertions(+), 367 deletions(-)
 create mode 100644 mm/vma_init.c

--
2.49.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem
  2025-04-25 14:54 [PATCH v2 0/3] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
@ 2025-04-25 14:54 ` Lorenzo Stoakes
  2025-04-25 17:09   ` Kees Cook
  2025-04-25 14:54 ` [PATCH v2 2/3] mm: move dup_mmap() to mm Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 3/3] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2 siblings, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

There are peculiarities within the kernel where what is very clearly mm
code is performed elsewhere arbitrarily.

This violates separation of concerns and makes it harder to refactor code
to make changes to how fundamental initialisation and operation of mm logic
is performed.

One such case is the creation of the VMA containing the initial stack upon
execve()'ing a new process. This is currently performed in __bprm_mm_init()
in fs/exec.c.

Abstract this operation to create_init_stack_vma(). This allows us to limit
use of vma allocation and free code to fork and mm only.

We previously did the same for the step at which we relocate the initial
stack VMA downwards via relocate_vma_down(), now we move the initial VMA
establishment too.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/exec.c          | 51 +---------------------------------
 include/linux/mm.h |  2 ++
 mm/mmap.c          | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..ef34a68ef825 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -244,56 +244,7 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
 
 static int __bprm_mm_init(struct linux_binprm *bprm)
 {
-	int err;
-	struct vm_area_struct *vma = NULL;
-	struct mm_struct *mm = bprm->mm;
-
-	bprm->vma = vma = vm_area_alloc(mm);
-	if (!vma)
-		return -ENOMEM;
-	vma_set_anonymous(vma);
-
-	if (mmap_write_lock_killable(mm)) {
-		err = -EINTR;
-		goto err_free;
-	}
-
-	/*
-	 * Need to be called with mmap write lock
-	 * held, to avoid race with ksmd.
-	 */
-	err = ksm_execve(mm);
-	if (err)
-		goto err_ksm;
-
-	/*
-	 * Place the stack at the largest stack address the architecture
-	 * supports. Later, we'll move this to an appropriate place. We don't
-	 * use STACK_TOP because that can depend on attributes which aren't
-	 * configured yet.
-	 */
-	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_end = STACK_TOP_MAX;
-	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
-
-	err = insert_vm_struct(mm, vma);
-	if (err)
-		goto err;
-
-	mm->stack_vm = mm->total_vm = 1;
-	mmap_write_unlock(mm);
-	bprm->p = vma->vm_end - sizeof(void *);
-	return 0;
-err:
-	ksm_exit(mm);
-err_ksm:
-	mmap_write_unlock(mm);
-err_free:
-	bprm->vma = NULL;
-	vm_area_free(vma);
-	return err;
+	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
 }
 
 static bool valid_arg_len(struct linux_binprm *bprm, long len)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef22..fa84e59a99bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3223,6 +3223,8 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void exit_mmap(struct mm_struct *);
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p);
 int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
 bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
 				 unsigned long addr, bool write);
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..ec8572a93418 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1717,6 +1717,75 @@ static int __meminit init_reserve_notifier(void)
 }
 subsys_initcall(init_reserve_notifier);
 
+/*
+ * Establish the stack VMA in an execve'd process, located temporarily at the
+ * maximum stack address provided by the architecture.
+ *
+ * We later relocate this downwards in relocate_vma_down().
+ *
+ * This function is almost certainly NOT what you want for anything other than
+ * early executable initialisation.
+ *
+ * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
+ * maximum addressable location in the stack (that is capable of storing a
+ * system word of data).
+ */
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p)
+{
+	int err;
+	struct vm_area_struct *vma = vm_area_alloc(mm);
+
+	if (!vma)
+		return -ENOMEM;
+
+	vma_set_anonymous(vma);
+
+	if (mmap_write_lock_killable(mm)) {
+		err = -EINTR;
+		goto err_free;
+	}
+
+	/*
+	 * Need to be called with mmap write lock
+	 * held, to avoid race with ksmd.
+	 */
+	err = ksm_execve(mm);
+	if (err)
+		goto err_ksm;
+
+	/*
+	 * Place the stack at the largest stack address the architecture
+	 * supports. Later, we'll move this to an appropriate place. We don't
+	 * use STACK_TOP because that can depend on attributes which aren't
+	 * configured yet.
+	 */
+	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_end = STACK_TOP_MAX;
+	vma->vm_start = vma->vm_end - PAGE_SIZE;
+	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+
+	err = insert_vm_struct(mm, vma);
+	if (err)
+		goto err;
+
+	mm->stack_vm = mm->total_vm = 1;
+	mmap_write_unlock(mm);
+	*vmap = vma;
+	*top_mem_p = vma->vm_end - sizeof(void *);
+	return 0;
+
+err:
+	ksm_exit(mm);
+err_ksm:
+	mmap_write_unlock(mm);
+err_free:
+	*vmap = NULL;
+	vm_area_free(vma);
+	return err;
+}
+
 /*
  * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
  * this VMA and its relocated range, which will now reside at [vma->vm_start -
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] mm: move dup_mmap() to mm
  2025-04-25 14:54 [PATCH v2 0/3] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-25 14:54 ` Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 3/3] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2 siblings, 0 replies; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

This is a key step in our being able to abstract and isolate VMA allocation
and destruction logic.

This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.

We do the same for the nommu version which is substantially simpler.

We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.

While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
---
 kernel/fork.c | 189 ++------------------------------------------------
 mm/internal.h |   2 +
 mm/mmap.c     | 181 +++++++++++++++++++++++++++++++++++++++++++++--
 mm/nommu.c    |   8 +++
 4 files changed, 189 insertions(+), 191 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25..ac9f9267a473 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -112,6 +112,9 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+/* For dup_mmap(). */
+#include "../mm/internal.h"
+
 #include <trace/events/sched.h>
 
 #define CREATE_TRACE_POINTS
@@ -589,7 +592,7 @@ void free_task(struct task_struct *tsk)
 }
 EXPORT_SYMBOL(free_task);
 
-static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	struct file *exe_file;
 
@@ -604,183 +607,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 }
 
 #ifdef CONFIG_MMU
-static __latent_entropy int dup_mmap(struct mm_struct *mm,
-					struct mm_struct *oldmm)
-{
-	struct vm_area_struct *mpnt, *tmp;
-	int retval;
-	unsigned long charge = 0;
-	LIST_HEAD(uf);
-	VMA_ITERATOR(vmi, mm, 0);
-
-	if (mmap_write_lock_killable(oldmm))
-		return -EINTR;
-	flush_cache_dup_mm(oldmm);
-	uprobe_dup_mmap(oldmm, mm);
-	/*
-	 * Not linked in yet - no deadlock potential:
-	 */
-	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
-
-	/* No ordering required: file already has been exposed. */
-	dup_mm_exe_file(mm, oldmm);
-
-	mm->total_vm = oldmm->total_vm;
-	mm->data_vm = oldmm->data_vm;
-	mm->exec_vm = oldmm->exec_vm;
-	mm->stack_vm = oldmm->stack_vm;
-
-	/* Use __mt_dup() to efficiently build an identical maple tree. */
-	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
-	if (unlikely(retval))
-		goto out;
-
-	mt_clear_in_rcu(vmi.mas.tree);
-	for_each_vma(vmi, mpnt) {
-		struct file *file;
-
-		vma_start_write(mpnt);
-		if (mpnt->vm_flags & VM_DONTCOPY) {
-			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
-						    mpnt->vm_end, GFP_KERNEL);
-			if (retval)
-				goto loop_out;
-
-			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
-			continue;
-		}
-		charge = 0;
-		/*
-		 * Don't duplicate many vmas if we've been oom-killed (for
-		 * example)
-		 */
-		if (fatal_signal_pending(current)) {
-			retval = -EINTR;
-			goto loop_out;
-		}
-		if (mpnt->vm_flags & VM_ACCOUNT) {
-			unsigned long len = vma_pages(mpnt);
-
-			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
-				goto fail_nomem;
-			charge = len;
-		}
-		tmp = vm_area_dup(mpnt);
-		if (!tmp)
-			goto fail_nomem;
-
-		/* track_pfn_copy() will later take care of copying internal state. */
-		if (unlikely(tmp->vm_flags & VM_PFNMAP))
-			untrack_pfn_clear(tmp);
-
-		retval = vma_dup_policy(mpnt, tmp);
-		if (retval)
-			goto fail_nomem_policy;
-		tmp->vm_mm = mm;
-		retval = dup_userfaultfd(tmp, &uf);
-		if (retval)
-			goto fail_nomem_anon_vma_fork;
-		if (tmp->vm_flags & VM_WIPEONFORK) {
-			/*
-			 * VM_WIPEONFORK gets a clean slate in the child.
-			 * Don't prepare anon_vma until fault since we don't
-			 * copy page for current vma.
-			 */
-			tmp->anon_vma = NULL;
-		} else if (anon_vma_fork(tmp, mpnt))
-			goto fail_nomem_anon_vma_fork;
-		vm_flags_clear(tmp, VM_LOCKED_MASK);
-		/*
-		 * Copy/update hugetlb private vma information.
-		 */
-		if (is_vm_hugetlb_page(tmp))
-			hugetlb_dup_vma_private(tmp);
-
-		/*
-		 * Link the vma into the MT. After using __mt_dup(), memory
-		 * allocation is not necessary here, so it cannot fail.
-		 */
-		vma_iter_bulk_store(&vmi, tmp);
-
-		mm->map_count++;
-
-		if (tmp->vm_ops && tmp->vm_ops->open)
-			tmp->vm_ops->open(tmp);
-
-		file = tmp->vm_file;
-		if (file) {
-			struct address_space *mapping = file->f_mapping;
-
-			get_file(file);
-			i_mmap_lock_write(mapping);
-			if (vma_is_shared_maywrite(tmp))
-				mapping_allow_writable(mapping);
-			flush_dcache_mmap_lock(mapping);
-			/* insert tmp into the share list, just after mpnt */
-			vma_interval_tree_insert_after(tmp, mpnt,
-					&mapping->i_mmap);
-			flush_dcache_mmap_unlock(mapping);
-			i_mmap_unlock_write(mapping);
-		}
-
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(tmp, mpnt);
-
-		if (retval) {
-			mpnt = vma_next(&vmi);
-			goto loop_out;
-		}
-	}
-	/* a new mm has just been created */
-	retval = arch_dup_mmap(oldmm, mm);
-loop_out:
-	vma_iter_free(&vmi);
-	if (!retval) {
-		mt_set_in_rcu(vmi.mas.tree);
-		ksm_fork(mm, oldmm);
-		khugepaged_fork(mm, oldmm);
-	} else {
-
-		/*
-		 * The entire maple tree has already been duplicated. If the
-		 * mmap duplication fails, mark the failure point with
-		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
-		 * stop releasing VMAs that have not been duplicated after this
-		 * point.
-		 */
-		if (mpnt) {
-			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
-			mas_store(&vmi.mas, XA_ZERO_ENTRY);
-			/* Avoid OOM iterating a broken tree */
-			set_bit(MMF_OOM_SKIP, &mm->flags);
-		}
-		/*
-		 * The mm_struct is going to exit, but the locks will be dropped
-		 * first.  Set the mm_struct as unstable is advisable as it is
-		 * not fully initialised.
-		 */
-		set_bit(MMF_UNSTABLE, &mm->flags);
-	}
-out:
-	mmap_write_unlock(mm);
-	flush_tlb_mm(oldmm);
-	mmap_write_unlock(oldmm);
-	if (!retval)
-		dup_userfaultfd_complete(&uf);
-	else
-		dup_userfaultfd_fail(&uf);
-	return retval;
-
-fail_nomem_anon_vma_fork:
-	mpol_put(vma_policy(tmp));
-fail_nomem_policy:
-	vm_area_free(tmp);
-fail_nomem:
-	retval = -ENOMEM;
-	vm_unacct_memory(charge);
-	goto loop_out;
-}
-
 static inline int mm_alloc_pgd(struct mm_struct *mm)
 {
 	mm->pgd = pgd_alloc(mm);
@@ -794,13 +620,6 @@ static inline void mm_free_pgd(struct mm_struct *mm)
 	pgd_free(mm, mm->pgd);
 }
 #else
-static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
-{
-	mmap_write_lock(oldmm);
-	dup_mm_exe_file(mm, oldmm);
-	mmap_write_unlock(oldmm);
-	return 0;
-}
 #define mm_alloc_pgd(mm)	(0)
 #define mm_free_pgd(mm)
 #endif /* CONFIG_MMU */
diff --git a/mm/internal.h b/mm/internal.h
index 838f840ded83..39067b3117a4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1630,5 +1630,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
 }
 #endif /* CONFIG_PT_RECLAIM */
 
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index ec8572a93418..5ba12aa8be59 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1869,7 +1869,6 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
 	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
 }
 
-#ifdef CONFIG_MMU
 /*
  * Obtain a read lock on mm->mmap_lock, if the specified address is below the
  * start of the VMA, the intent is to perform a write, and it is a
@@ -1913,10 +1912,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
 	mmap_write_downgrade(mm);
 	return true;
 }
-#else
-bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
-				 unsigned long addr, bool write)
+
+__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	return false;
+	struct vm_area_struct *mpnt, *tmp;
+	int retval;
+	unsigned long charge = 0;
+	LIST_HEAD(uf);
+	VMA_ITERATOR(vmi, mm, 0);
+
+	if (mmap_write_lock_killable(oldmm))
+		return -EINTR;
+	flush_cache_dup_mm(oldmm);
+	uprobe_dup_mmap(oldmm, mm);
+	/*
+	 * Not linked in yet - no deadlock potential:
+	 */
+	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
+
+	/* No ordering required: file already has been exposed. */
+	dup_mm_exe_file(mm, oldmm);
+
+	mm->total_vm = oldmm->total_vm;
+	mm->data_vm = oldmm->data_vm;
+	mm->exec_vm = oldmm->exec_vm;
+	mm->stack_vm = oldmm->stack_vm;
+
+	/* Use __mt_dup() to efficiently build an identical maple tree. */
+	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
+	if (unlikely(retval))
+		goto out;
+
+	mt_clear_in_rcu(vmi.mas.tree);
+	for_each_vma(vmi, mpnt) {
+		struct file *file;
+
+		vma_start_write(mpnt);
+		if (mpnt->vm_flags & VM_DONTCOPY) {
+			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
+						    mpnt->vm_end, GFP_KERNEL);
+			if (retval)
+				goto loop_out;
+
+			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+			continue;
+		}
+		charge = 0;
+		/*
+		 * Don't duplicate many vmas if we've been oom-killed (for
+		 * example)
+		 */
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto loop_out;
+		}
+		if (mpnt->vm_flags & VM_ACCOUNT) {
+			unsigned long len = vma_pages(mpnt);
+
+			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
+				goto fail_nomem;
+			charge = len;
+		}
+
+		tmp = vm_area_dup(mpnt);
+		if (!tmp)
+			goto fail_nomem;
+
+		/* track_pfn_copy() will later take care of copying internal state. */
+		if (unlikely(tmp->vm_flags & VM_PFNMAP))
+			untrack_pfn_clear(tmp);
+
+		retval = vma_dup_policy(mpnt, tmp);
+		if (retval)
+			goto fail_nomem_policy;
+		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
+		if (tmp->vm_flags & VM_WIPEONFORK) {
+			/*
+			 * VM_WIPEONFORK gets a clean slate in the child.
+			 * Don't prepare anon_vma until fault since we don't
+			 * copy page for current vma.
+			 */
+			tmp->anon_vma = NULL;
+		} else if (anon_vma_fork(tmp, mpnt))
+			goto fail_nomem_anon_vma_fork;
+		vm_flags_clear(tmp, VM_LOCKED_MASK);
+		/*
+		 * Copy/update hugetlb private vma information.
+		 */
+		if (is_vm_hugetlb_page(tmp))
+			hugetlb_dup_vma_private(tmp);
+
+		/*
+		 * Link the vma into the MT. After using __mt_dup(), memory
+		 * allocation is not necessary here, so it cannot fail.
+		 */
+		vma_iter_bulk_store(&vmi, tmp);
+
+		mm->map_count++;
+
+		if (tmp->vm_ops && tmp->vm_ops->open)
+			tmp->vm_ops->open(tmp);
+
+		file = tmp->vm_file;
+		if (file) {
+			struct address_space *mapping = file->f_mapping;
+
+			get_file(file);
+			i_mmap_lock_write(mapping);
+			if (vma_is_shared_maywrite(tmp))
+				mapping_allow_writable(mapping);
+			flush_dcache_mmap_lock(mapping);
+			/* insert tmp into the share list, just after mpnt */
+			vma_interval_tree_insert_after(tmp, mpnt,
+					&mapping->i_mmap);
+			flush_dcache_mmap_unlock(mapping);
+			i_mmap_unlock_write(mapping);
+		}
+
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			retval = copy_page_range(tmp, mpnt);
+
+		if (retval) {
+			mpnt = vma_next(&vmi);
+			goto loop_out;
+		}
+	}
+	/* a new mm has just been created */
+	retval = arch_dup_mmap(oldmm, mm);
+loop_out:
+	vma_iter_free(&vmi);
+	if (!retval) {
+		mt_set_in_rcu(vmi.mas.tree);
+		ksm_fork(mm, oldmm);
+		khugepaged_fork(mm, oldmm);
+	} else {
+
+		/*
+		 * The entire maple tree has already been duplicated. If the
+		 * mmap duplication fails, mark the failure point with
+		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
+		 * stop releasing VMAs that have not been duplicated after this
+		 * point.
+		 */
+		if (mpnt) {
+			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
+			mas_store(&vmi.mas, XA_ZERO_ENTRY);
+			/* Avoid OOM iterating a broken tree */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		/*
+		 * The mm_struct is going to exit, but the locks will be dropped
+		 * first.  Set the mm_struct as unstable is advisable as it is
+		 * not fully initialised.
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+	}
+out:
+	mmap_write_unlock(mm);
+	flush_tlb_mm(oldmm);
+	mmap_write_unlock(oldmm);
+	if (!retval)
+		dup_userfaultfd_complete(&uf);
+	else
+		dup_userfaultfd_fail(&uf);
+	return retval;
+
+fail_nomem_anon_vma_fork:
+	mpol_put(vma_policy(tmp));
+fail_nomem_policy:
+	vm_area_free(tmp);
+fail_nomem:
+	retval = -ENOMEM;
+	vm_unacct_memory(charge);
+	goto loop_out;
 }
-#endif
diff --git a/mm/nommu.c b/mm/nommu.c
index 2b4d304c6445..a142fc258d39 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1874,3 +1874,11 @@ static int __meminit init_admin_reserve(void)
 	return 0;
 }
 subsys_initcall(init_admin_reserve);
+
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	mmap_write_lock(oldmm);
+	dup_mm_exe_file(mm, oldmm);
+	mmap_write_unlock(oldmm);
+	return 0;
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 14:54 [PATCH v2 0/3] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
  2025-04-25 14:54 ` [PATCH v2 2/3] mm: move dup_mmap() to mm Lorenzo Stoakes
@ 2025-04-25 14:54 ` Lorenzo Stoakes
  2 siblings, 0 replies; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Right now these are performed in kernel/fork.c which is odd and a violation
of separation of concerns, as well as preventing us from integrating this
and related logic into userland VMA testing going forward, and perhaps more
importantly - enabling us to, in a subsequent commit, make VMA
allocation/freeing a purely internal mm operation.

There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.

To square the circle, let's add a new file - vma_init.c. This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.

This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.

Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 MAINTAINERS                      |   1 +
 kernel/fork.c                    |  88 -------------------
 mm/Makefile                      |   2 +-
 mm/mmap.c                        |   3 +-
 mm/nommu.c                       |   4 +-
 mm/vma.h                         |   6 ++
 mm/vma_init.c                    | 101 ++++++++++++++++++++++
 tools/testing/vma/Makefile       |   2 +-
 tools/testing/vma/vma.c          |  26 ++++--
 tools/testing/vma/vma_internal.h | 143 +++++++++++++++++++++++++------
 10 files changed, 250 insertions(+), 126 deletions(-)
 create mode 100644 mm/vma_init.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4015227645cc..ce422b268cb6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15608,6 +15608,7 @@ F:	mm/mremap.c
 F:	mm/mseal.c
 F:	mm/vma.c
 F:	mm/vma.h
+F:	mm/vma_init.c
 F:	mm/vma_internal.h
 F:	tools/testing/selftests/mm/merge.c
 F:	tools/testing/vma/
diff --git a/kernel/fork.c b/kernel/fork.c
index ac9f9267a473..9e4616dacd82 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -431,88 +431,9 @@ struct kmem_cache *files_cachep;
 /* SLAB cache for fs_struct structures (tsk->fs) */
 struct kmem_cache *fs_cachep;
 
-/* SLAB cache for vm_area_struct structures */
-static struct kmem_cache *vm_area_cachep;
-
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
-
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (!vma)
-		return NULL;
-
-	vma_init(vma, mm);
-
-	return vma;
-}
-
-static void vm_area_init_from(const struct vm_area_struct *src,
-			      struct vm_area_struct *dest)
-{
-	dest->vm_mm = src->vm_mm;
-	dest->vm_ops = src->vm_ops;
-	dest->vm_start = src->vm_start;
-	dest->vm_end = src->vm_end;
-	dest->anon_vma = src->anon_vma;
-	dest->vm_pgoff = src->vm_pgoff;
-	dest->vm_file = src->vm_file;
-	dest->vm_private_data = src->vm_private_data;
-	vm_flags_init(dest, src->vm_flags);
-	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
-	       sizeof(dest->vm_page_prot));
-	/*
-	 * src->shared.rb may be modified concurrently when called from
-	 * dup_mmap(), but the clone will reinitialize it.
-	 */
-	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
-	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
-	       sizeof(dest->vm_userfaultfd_ctx));
-#ifdef CONFIG_ANON_VMA_NAME
-	dest->anon_name = src->anon_name;
-#endif
-#ifdef CONFIG_SWAP
-	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
-	       sizeof(dest->swap_readahead_info));
-#endif
-#ifndef CONFIG_MMU
-	dest->vm_region = src->vm_region;
-#endif
-#ifdef CONFIG_NUMA
-	dest->vm_policy = src->vm_policy;
-#endif
-}
-
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-
-	if (!new)
-		return NULL;
-
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	vm_area_init_from(orig, new);
-	vma_lock_init(new, true);
-	INIT_LIST_HEAD(&new->anon_vma_chain);
-	vma_numab_state_init(new);
-	dup_anon_vma_name(orig, new);
-
-	return new;
-}
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-	/* The vma should be detached while being destroyed. */
-	vma_assert_detached(vma);
-	vma_numab_state_free(vma);
-	free_anon_vma_name(vma);
-	kmem_cache_free(vm_area_cachep, vma);
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3033,11 +2954,6 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
-	struct kmem_cache_args args = {
-		.use_freeptr_offset = true,
-		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
-	};
-
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3054,10 +2970,6 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = kmem_cache_create("vm_area_struct",
-			sizeof(struct vm_area_struct), &args,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
-			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/mm/Makefile b/mm/Makefile
index 9d7e5b5bb694..88e80df4b539 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -55,7 +55,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o show_mem.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o vma_init.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ba12aa8be59..99e51d82ac0b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1596,7 +1596,7 @@ static const struct ctl_table mmap_table[] = {
 #endif /* CONFIG_SYSCTL */
 
 /*
- * initialise the percpu counter for VM
+ * initialise the percpu counter for VM, initialise VMA state.
  */
 void __init mmap_init(void)
 {
@@ -1607,6 +1607,7 @@ void __init mmap_init(void)
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", mmap_table);
 #endif
+	vma_state_init();
 }
 
 /*
diff --git a/mm/nommu.c b/mm/nommu.c
index a142fc258d39..0bf4849b8204 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -399,7 +399,8 @@ static const struct ctl_table nommu_table[] = {
 };
 
 /*
- * initialise the percpu counter for VM and region record slabs
+ * initialise the percpu counter for VM and region record slabs, initialise VMA
+ * state.
  */
 void __init mmap_init(void)
 {
@@ -409,6 +410,7 @@ void __init mmap_init(void)
 	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 	register_sysctl_init("vm", nommu_table);
+	vma_state_init();
 }
 
 /*
diff --git a/mm/vma.h b/mm/vma.h
index 149926e8a6d1..7f476ca3d52e 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -548,4 +548,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
+/* vma_init.h, shared between CONFIG_MMU and nommu. */
+void __init vma_state_init(void);
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm);
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
+void vm_area_free(struct vm_area_struct *vma);
+
 #endif	/* __MM_VMA_H */
diff --git a/mm/vma_init.c b/mm/vma_init.c
new file mode 100644
index 000000000000..967ca8517986
--- /dev/null
+++ b/mm/vma_init.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/*
+ * Functions for initialisaing, allocating, freeing and duplicating VMAs. Shared
+ * between CONFIG_MMU and non-CONFIG_MMU kernel configurations.
+ */
+
+#include "vma_internal.h"
+#include "vma.h"
+
+/* SLAB cache for vm_area_struct structures */
+static struct kmem_cache *vm_area_cachep;
+
+void __init vma_state_init(void)
+{
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT);
+}
+
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+
+	return vma;
+}
+
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+	dest->vm_region = src->vm_region;
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	vm_area_init_from(orig, new);
+	vma_lock_init(new, true);
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	vma_numab_state_init(new);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
+	vma_numab_state_free(vma);
+	free_anon_vma_name(vma);
+	kmem_cache_free(vm_area_cachep, vma);
+}
diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile
index 860fd2311dcc..4fa5a371e277 100644
--- a/tools/testing/vma/Makefile
+++ b/tools/testing/vma/Makefile
@@ -9,7 +9,7 @@ include ../shared/shared.mk
 OFILES = $(SHARED_OFILES) vma.o maple-shim.o
 TARGETS = vma
 
-vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h
+vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma_init.c ../../../mm/vma.h
 
 vma:	$(OFILES)
 	$(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS)
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 7cfd6e31db10..98a1a0390583 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -28,6 +28,7 @@ unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
  * Directly import the VMA implementation here. Our vma_internal.h wrapper
  * provides userland-equivalent functionality for everything vma.c uses.
  */
+#include "../../../mm/vma_init.c"
 #include "../../../mm/vma.c"
 
 const struct vm_operations_struct vma_dummy_vm_ops;
@@ -90,6 +91,12 @@ static int attach_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 	return res;
 }
 
+static void detach_free_vma(struct vm_area_struct *vma)
+{
+	vma_mark_detached(vma);
+	vm_area_free(vma);
+}
+
 /* Helper function to allocate a VMA and link it to the tree. */
 static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
 						 unsigned long start,
@@ -103,7 +110,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
 		return NULL;
 
 	if (attach_vma(mm, vma)) {
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		return NULL;
 	}
 
@@ -248,7 +255,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
 
 	vma_iter_set(vmi, 0);
 	for_each_vma(*vmi, vma) {
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		count++;
 	}
 
@@ -318,7 +325,7 @@ static bool test_simple_merge(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->vm_flags, flags);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -360,7 +367,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x1000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	vma_iter_clear(&vmi);
 
 	vma = vma_next(&vmi);
@@ -369,7 +376,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x2000);
 	ASSERT_EQ(vma->vm_pgoff, 1);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	vma_iter_clear(&vmi);
 
 	vma = vma_next(&vmi);
@@ -378,7 +385,7 @@ static bool test_simple_modify(void)
 	ASSERT_EQ(vma->vm_end, 0x3000);
 	ASSERT_EQ(vma->vm_pgoff, 2);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -406,7 +413,7 @@ static bool test_simple_expand(void)
 	ASSERT_EQ(vma->vm_end, 0x3000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -427,7 +434,7 @@ static bool test_simple_shrink(void)
 	ASSERT_EQ(vma->vm_end, 0x1000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 
-	vm_area_free(vma);
+	detach_free_vma(vma);
 	mtree_destroy(&mm.mm_mt);
 
 	return true;
@@ -618,7 +625,7 @@ static bool test_merge_new(void)
 		ASSERT_EQ(vma->vm_pgoff, 0);
 		ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 
-		vm_area_free(vma);
+		detach_free_vma(vma);
 		count++;
 	}
 
@@ -1667,6 +1674,7 @@ int main(void)
 	int num_tests = 0, num_fail = 0;
 
 	maple_tree_init();
+	vma_state_init();
 
 #define TEST(name)							\
 	do {								\
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 572ab2cea763..28f778818d3f 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -135,6 +135,10 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
+#define data_race(expr) expr
+
+#define ASSERT_EXCLUSIVE_WRITER(x)
+
 struct kref {
 	refcount_t refcount;
 };
@@ -235,6 +239,8 @@ struct file {
 
 #define VMA_LOCK_OFFSET	0x40000000
 
+typedef struct { unsigned long v; } freeptr_t;
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -244,9 +250,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
@@ -421,6 +425,65 @@ struct vm_unmapped_area_info {
 	unsigned long start_gap;
 };
 
+struct kmem_cache_args {
+	/**
+	 * @align: The required alignment for the objects.
+	 *
+	 * %0 means no specific alignment is requested.
+	 */
+	unsigned int align;
+	/**
+	 * @useroffset: Usercopy region offset.
+	 *
+	 * %0 is a valid offset, when @usersize is non-%0
+	 */
+	unsigned int useroffset;
+	/**
+	 * @usersize: Usercopy region size.
+	 *
+	 * %0 means no usercopy region is specified.
+	 */
+	unsigned int usersize;
+	/**
+	 * @freeptr_offset: Custom offset for the free pointer
+	 * in &SLAB_TYPESAFE_BY_RCU caches
+	 *
+	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
+	 * outside of the object. This might cause the object to grow in size.
+	 * Cache creators that have a reason to avoid this can specify a custom
+	 * free pointer offset in their struct where the free pointer will be
+	 * placed.
+	 *
+	 * Note that placing the free pointer inside the object requires the
+	 * caller to ensure that no fields are invalidated that are required to
+	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
+	 * details).
+	 *
+	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
+	 * is specified, %use_freeptr_offset must be set %true.
+	 *
+	 * Note that @ctor currently isn't supported with custom free pointers
+	 * as a @ctor requires an external free pointer.
+	 */
+	unsigned int freeptr_offset;
+	/**
+	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
+	 */
+	bool use_freeptr_offset;
+	/**
+	 * @ctor: A constructor for the objects.
+	 *
+	 * The constructor is invoked for each object in a newly allocated slab
+	 * page. It is the cache user's responsibility to free object in the
+	 * same state as after calling the constructor, or deal appropriately
+	 * with any differences between a freshly constructed and a reallocated
+	 * object.
+	 *
+	 * %NULL means no constructor.
+	 */
+	void (*ctor)(void *);
+};
+
 static inline void vma_iter_invalidate(struct vma_iterator *vmi)
 {
 	mas_pause(&vmi->mas);
@@ -505,31 +568,38 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
-static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
+struct kmem_cache {
+	const char *name;
+	size_t object_size;
+	struct kmem_cache_args *args;
+};
 
-	if (!vma)
-		return NULL;
+static inline struct kmem_cache *__kmem_cache_create(const char *name,
+						     size_t object_size,
+						     struct kmem_cache_args *args)
+{
+	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
 
-	vma_init(vma, mm);
+	ret->name = name;
+	ret->object_size = object_size;
+	ret->args = args;
 
-	return vma;
+	return ret;
 }
 
-static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	__kmem_cache_create((__name), (__object_size), (__args))
 
-	if (!new)
-		return NULL;
+static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	(void)gfpflags;
 
-	memcpy(new, orig, sizeof(*new));
-	refcount_set(&new->vm_refcnt, 0);
-	new->vm_lock_seq = UINT_MAX;
-	INIT_LIST_HEAD(&new->anon_vma_chain);
+	return calloc(s->object_size, 1);
+}
 
-	return new;
+static inline void kmem_cache_free(struct kmem_cache *s, void *x)
+{
+	free(x);
 }
 
 /*
@@ -696,11 +766,6 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void lru_add_drain(void)
 {
 }
@@ -1240,4 +1305,32 @@ static inline int mapping_map_writable(struct address_space *mapping)
 	return 0;
 }
 
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
+{
+	(void)vma;
+	(void)reset_refcnt;
+}
+
+static inline void vma_numab_state_init(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void vma_numab_state_free(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
+				     struct vm_area_struct *new_vma)
+{
+	(void)orig_vma;
+	(void)new_vma;
+}
+
+static inline void free_anon_vma_name(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem
  2025-04-25 14:54 ` [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-25 17:09   ` Kees Cook
  2025-04-28  8:53     ` Lorenzo Stoakes
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-04-25 17:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/exec.c          | 51 +---------------------------------

I'm kind of on the fence about this. On the one hand, yes, it's all vma
goo, and should live with the rest of vma code, as you suggest. On the
other had, exec is the only consumer of this behavior, and moving it
out of fs/exec.c means that changes to the code that specifically only
impacts exec are now in a separate file, and will no longer get exec
maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
notoriously fragile, so I'm kind of generally paranoid about changes to
its behaviors going unnoticed.

In defense of moving it, yes, this routine has gotten updates over the
many years, but it's relatively stable. But at least one thing has gone in
without exec maintainer review recently (I would have Acked it, but the
point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
Everything else was before I took on the role officially (Nov 2022).

So I guess I'm asking, how do we make sure stuff pulled out of exec
still gets exec maintainer review?

> [...]
>  static int __bprm_mm_init(struct linux_binprm *bprm)
>  {
> -	int err;
> [...]
> -	return err;
> +	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
>  }

I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
However, it doesn't really look like it makes too much sense for the NOMMU
logic get moved as well, since it explicitly depends on exec-specific
values (MAX_ARG_PAGES), so perhaps something like this:

diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..313dc70e0012 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
 	task_unlock(current->group_leader);
 
-	err = __bprm_mm_init(bprm);
+#ifndef CONFIG_MMU
+	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
+#else
+	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
 	if (err)
 		goto err;
+#endif
 
 	return 0;
 


On a related note, I'd like to point out that my claim that exec is
the only consumer here, is slightly a lie. Technically this is correct,
but only because this is specifically setting up the _stack_.

The rest of the VMA setup actually surrounds this code (another
reason I remain unhappy about moving it). Specifically the mm_alloc()
before __bprm_mm_init (which is reached through alloc_brpm()). And
then, following alloc_bprm() in do_execveat_common(), is the call to
setup_new_exec(), which does the rest of the VMA setup, specifically
arch_pick_mmap_layout() and related fiddling.

The "create userspace VMA" logic, mostly through mm_alloc(), is
used in a few places (e.g. text poking), but the "bring up a _usable_
userspace VMA" logic (i.e. one also with functional mmap) is repeated in
lib/kunit/alloc_user.c for allowing testing of code that touches userspace
(see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
don't actually run userspace code, so no stack is set up.)

I guess what I'm trying to say is that I think we need a more clearly
defined "create usable userspace VMA" API, as we've got at least 3
scattered approaches right now: exec ("everything"), non-mmap-non-stack
users (text poking, et al), and mmap-but-not-stack users (kunit tests).

And the One True User of a full userspace VMA, exec, has the full setup
scattered into several phases, mostly due to needing to separate those
phases because it needs to progressively gather the information needed
to correctly configure each piece:
- set up userspace VMA at all (mm_alloc)
- set up a stack because exec args need to go somewhere (__bprm_mm_init)
- move stack to the right place (depends on executable binary and task bits)
- set up mmap (arch_pick_mmap_layout) to actually load executable binary
  (depends on arch, binary, and task bits)

Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
get relocated. It'll _probably_ be fine, but I get antsy about changes
to code that only exec uses...

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem
  2025-04-25 17:09   ` Kees Cook
@ 2025-04-28  8:53     ` Lorenzo Stoakes
  2025-04-28 10:46       ` Lorenzo Stoakes
  0 siblings, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28  8:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
> On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> > There are peculiarities within the kernel where what is very clearly mm
> > code is performed elsewhere arbitrarily.
> >
> > This violates separation of concerns and makes it harder to refactor code
> > to make changes to how fundamental initialisation and operation of mm logic
> > is performed.
> >
> > One such case is the creation of the VMA containing the initial stack upon
> > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > in fs/exec.c.
> >
> > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > use of vma allocation and free code to fork and mm only.
> >
> > We previously did the same for the step at which we relocate the initial
> > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > establishment too.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/exec.c          | 51 +---------------------------------
>
> I'm kind of on the fence about this. On the one hand, yes, it's all vma
> goo, and should live with the rest of vma code, as you suggest. On the
> other had, exec is the only consumer of this behavior, and moving it
> out of fs/exec.c means that changes to the code that specifically only
> impacts exec are now in a separate file, and will no longer get exec
> maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
> notoriously fragile, so I'm kind of generally paranoid about changes to
> its behaviors going unnoticed.
>
> In defense of moving it, yes, this routine has gotten updates over the
> many years, but it's relatively stable. But at least one thing has gone in
> without exec maintainer review recently (I would have Acked it, but the
> point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
> Everything else was before I took on the role officially (Nov 2022).
>
> So I guess I'm asking, how do we make sure stuff pulled out of exec
> still gets exec maintainer review?

I think we have two options here:

1. Separate out this code into mm/vma_exec.c and treat it like
   mm/vma_init.c, then add you as a reviewer, so you have visibility on
   everything that happens there.
2. Add you as a reviewer to memory mapping in general.

I think 1 is preferable actually, as it'll reduce noise for you and then
you'll _always_ get notified about change in this code.

Note that we have done this previously for similar reasons with
relocate_vma_down() we could move this function into that file.

For the sake of saving time given time zone differences, let me explore
option 1in a v3, and obviously if that doesn't work for you let me know! :)

We'll have to then have fs/exec.c include ../mm/vma_exec.h, so it is _only_
exec that gets access to this.

>
> > [...]
> >  static int __bprm_mm_init(struct linux_binprm *bprm)
> >  {
> > -	int err;
> > [...]
> > -	return err;
> > +	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
> >  }
>
> I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
> However, it doesn't really look like it makes too much sense for the NOMMU
> logic get moved as well, since it explicitly depends on exec-specific
> values (MAX_ARG_PAGES), so perhaps something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 8e4ea5f1e64c..313dc70e0012 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
>  	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
>  	task_unlock(current->group_leader);
>
> -	err = __bprm_mm_init(bprm);
> +#ifndef CONFIG_MMU
> +	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
> +#else
> +	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
>  	if (err)
>  		goto err;
> +#endif
>
>  	return 0;

Sure I can respin with this.

>
>
>
> On a related note, I'd like to point out that my claim that exec is
> the only consumer here, is slightly a lie. Technically this is correct,
> but only because this is specifically setting up the _stack_.
>
> The rest of the VMA setup actually surrounds this code (another
> reason I remain unhappy about moving it). Specifically the mm_alloc()
> before __bprm_mm_init (which is reached through alloc_brpm()). And
> then, following alloc_bprm() in do_execveat_common(), is the call to
> setup_new_exec(), which does the rest of the VMA setup, specifically
> arch_pick_mmap_layout() and related fiddling.

No other callers try to allocate/free vmas, which is the issue here.

>
> The "create userspace VMA" logic, mostly through mm_alloc(), is
> used in a few places (e.g. text poking), but the "bring up a _usable_
> userspace VMA" logic (i.e. one also with functional mmap) is repeated in
> lib/kunit/alloc_user.c for allowing testing of code that touches userspace
> (see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
> don't actually run userspace code, so no stack is set up.)

Hm but this seems something different? This is using the mm_alloc()
interface?

>
> I guess what I'm trying to say is that I think we need a more clearly
> defined "create usable userspace VMA" API, as we've got at least 3
> scattered approaches right now: exec ("everything"), non-mmap-non-stack
> users (text poking, et al), and mmap-but-not-stack users (kunit tests).

But only exec is actually allocating and essentially 'forcing in' a stack
vma like this?

I mean I'm all for having some nicely abstracted interface rather than
scattering open-coded stuff around, but the one and only objective _here_
is to disallow the use of very clearly mm-specific internal APIs elsewhere.

exec is of course 'special' in this stack behaviour, but I feel we can
express this 'specialness' better by having this kind of well-defined,
well-described functions exposed by mm, rather than handing over absolutely
fundamental API calls to any part of the kernel that wants them.

>
> And the One True User of a full userspace VMA, exec, has the full setup
> scattered into several phases, mostly due to needing to separate those
> phases because it needs to progressively gather the information needed
> to correctly configure each piece:
> - set up userspace VMA at all (mm_alloc)
> - set up a stack because exec args need to go somewhere (__bprm_mm_init)
> - move stack to the right place (depends on executable binary and task bits)
> - set up mmap (arch_pick_mmap_layout) to actually load executable binary
>   (depends on arch, binary, and task bits)
>
> Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
> get relocated. It'll _probably_ be fine, but I get antsy about changes
> to code that only exec uses...

Right, I understand and appreciate that, for sure. This is fiddly core
code, you worry about things breaking - I mean I feel you (don't ask me
about brk(), please please don't ask me about brk() :P)

So hopefully the proposed solution with vma_exec.c should cover this off
nicely?


>
> --
> Kees Cook

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem
  2025-04-28  8:53     ` Lorenzo Stoakes
@ 2025-04-28 10:46       ` Lorenzo Stoakes
  2025-04-28 20:29         ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 10:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Apr 28, 2025 at 09:53:05AM +0100, Lorenzo Stoakes wrote:
> On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
> > On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> > > There are peculiarities within the kernel where what is very clearly mm
> > > code is performed elsewhere arbitrarily.
> > >
> > > This violates separation of concerns and makes it harder to refactor code
> > > to make changes to how fundamental initialisation and operation of mm logic
> > > is performed.
> > >
> > > One such case is the creation of the VMA containing the initial stack upon
> > > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > > in fs/exec.c.
> > >
> > > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > > use of vma allocation and free code to fork and mm only.
> > >
> > > We previously did the same for the step at which we relocate the initial
> > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > > establishment too.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/exec.c          | 51 +---------------------------------
> >
> > I'm kind of on the fence about this. On the one hand, yes, it's all vma
> > goo, and should live with the rest of vma code, as you suggest. On the
> > other had, exec is the only consumer of this behavior, and moving it
> > out of fs/exec.c means that changes to the code that specifically only
> > impacts exec are now in a separate file, and will no longer get exec
> > maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
> > notoriously fragile, so I'm kind of generally paranoid about changes to
> > its behaviors going unnoticed.
> >
> > In defense of moving it, yes, this routine has gotten updates over the
> > many years, but it's relatively stable. But at least one thing has gone in
> > without exec maintainer review recently (I would have Acked it, but the
> > point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
> > Everything else was before I took on the role officially (Nov 2022).
> >
> > So I guess I'm asking, how do we make sure stuff pulled out of exec
> > still gets exec maintainer review?
>
> I think we have two options here:
>
> 1. Separate out this code into mm/vma_exec.c and treat it like
>    mm/vma_init.c, then add you as a reviewer, so you have visibility on
>    everything that happens there.
>

Actually, (off-list) Vlastimil made the very good suggestion that we can
just add this new file to both exec and memory mapping sections, have
tested it and it works!

So I think this should cover off your concerns?

[snip]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem
  2025-04-28 10:46       ` Lorenzo Stoakes
@ 2025-04-28 20:29         ` Kees Cook
  0 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2025-04-28 20:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel



On April 28, 2025 3:46:06 AM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>On Mon, Apr 28, 2025 at 09:53:05AM +0100, Lorenzo Stoakes wrote:
>> On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
>> > On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
>> > > There are peculiarities within the kernel where what is very clearly mm
>> > > code is performed elsewhere arbitrarily.
>> > >
>> > > This violates separation of concerns and makes it harder to refactor code
>> > > to make changes to how fundamental initialisation and operation of mm logic
>> > > is performed.
>> > >
>> > > One such case is the creation of the VMA containing the initial stack upon
>> > > execve()'ing a new process. This is currently performed in __bprm_mm_init()
>> > > in fs/exec.c.
>> > >
>> > > Abstract this operation to create_init_stack_vma(). This allows us to limit
>> > > use of vma allocation and free code to fork and mm only.
>> > >
>> > > We previously did the same for the step at which we relocate the initial
>> > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
>> > > establishment too.
>> > >
>> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > > Acked-by: David Hildenbrand <david@redhat.com>
>> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> > > ---
>> > >  fs/exec.c          | 51 +---------------------------------
>> >
>> > I'm kind of on the fence about this. On the one hand, yes, it's all vma
>> > goo, and should live with the rest of vma code, as you suggest. On the
>> > other had, exec is the only consumer of this behavior, and moving it
>> > out of fs/exec.c means that changes to the code that specifically only
>> > impacts exec are now in a separate file, and will no longer get exec
>> > maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
>> > notoriously fragile, so I'm kind of generally paranoid about changes to
>> > its behaviors going unnoticed.
>> >
>> > In defense of moving it, yes, this routine has gotten updates over the
>> > many years, but it's relatively stable. But at least one thing has gone in
>> > without exec maintainer review recently (I would have Acked it, but the
>> > point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
>> > Everything else was before I took on the role officially (Nov 2022).
>> >
>> > So I guess I'm asking, how do we make sure stuff pulled out of exec
>> > still gets exec maintainer review?
>>
>> I think we have two options here:
>>
>> 1. Separate out this code into mm/vma_exec.c and treat it like
>>    mm/vma_init.c, then add you as a reviewer, so you have visibility on
>>    everything that happens there.
>>
>
>Actually, (off-list) Vlastimil made the very good suggestion that we can
>just add this new file to both exec and memory mapping sections, have
>tested it and it works!
>
>So I think this should cover off your concerns?
>
>[snip]

Yes, this is brilliant! Totally works for me; thank you (and Vlastimil) for finding a good way to do it.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-28 20:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 14:54 [PATCH v2 0/3] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
2025-04-25 14:54 ` [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
2025-04-25 17:09   ` Kees Cook
2025-04-28  8:53     ` Lorenzo Stoakes
2025-04-28 10:46       ` Lorenzo Stoakes
2025-04-28 20:29         ` Kees Cook
2025-04-25 14:54 ` [PATCH v2 2/3] mm: move dup_mmap() to mm Lorenzo Stoakes
2025-04-25 14:54 ` [PATCH v2 3/3] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).