[PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm
@ 2025-04-24 21:15 Lorenzo Stoakes
  2025-04-24 21:15 ` [PATCH 1/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
which is a violation of separation of concerns, and leaves these functions
exposed to the rest of the kernel when they are in fact internal
implementation details.

Resolve this by moving this logic to mm, and making it internal to vma.c,
vma.h.

This also allows us, in future, to provide userland testing around this
functionality.

We additionally abstract dup_mmap() to mm, being careful to ensure
kernel/fork.c acceses thi via the mm internal header so it is not exposed
elsewhere in the kernel.

As part of this change, also abstract initial stack allocation performed in
__bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as
this code uses vm_area_alloc() and vm_area_free().

Lorenzo Stoakes (4):
  mm: abstract initial stack setup to mm subsystem
  mm: perform VMA allocation, freeing, duplication in mm
  mm: move dup_mmap() to mm
  mm: move vm_area_alloc,dup,free() functions to vma.c

 fs/exec.c                        |  51 +-----
 include/linux/mm.h               |  15 +-
 kernel/fork.c                    | 277 +------------------------------
 mm/debug_vm_pgtable.c            |   2 +
 mm/internal.h                    |   2 +
 mm/mmap.c                        | 254 +++++++++++++++++++++++++++-
 mm/nommu.c                       |  75 +++++++++
 mm/vma.c                         |  89 ++++++++++
 mm/vma.h                         |   8 +
 tools/testing/vma/vma.c          |   1 +
 tools/testing/vma/vma_internal.h | 151 +++++++++++++----
 11 files changed, 550 insertions(+), 375 deletions(-)

--
2.49.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] mm: abstract initial stack setup to mm subsystem
  2025-04-24 21:15 [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
@ 2025-04-24 21:15 ` Lorenzo Stoakes
  2025-04-24 21:30   ` David Hildenbrand
  2025-04-24 21:15 ` [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

There are peculiarities within the kernel where what is very clearly mm
code is performed elsewhere arbitrarily.

This violates separation of concerns and makes it harder to refactor code
to make changes to how fundamental initialisation and operation of mm logic
is performed.

One such case is the creation of the VMA containing the initial stack upon
execve()'ing a new process. This is currently performed in __bprm_mm_init()
in fs/exec.c.

Abstract this operation to create_init_stack_vma(). This allows us to limit
use of vma allocation and free code to fork and mm only.

We previously did the same for the step at which we relocate the initial
stack VMA downwards via relocate_vma_down(), now we move the initial VMA
establishment too.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/exec.c          | 51 +--------------------------------
 include/linux/mm.h |  2 ++
 mm/mmap.c          | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 50 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..ef34a68ef825 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -244,56 +244,7 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
 
 static int __bprm_mm_init(struct linux_binprm *bprm)
 {
-	int err;
-	struct vm_area_struct *vma = NULL;
-	struct mm_struct *mm = bprm->mm;
-
-	bprm->vma = vma = vm_area_alloc(mm);
-	if (!vma)
-		return -ENOMEM;
-	vma_set_anonymous(vma);
-
-	if (mmap_write_lock_killable(mm)) {
-		err = -EINTR;
-		goto err_free;
-	}
-
-	/*
-	 * Need to be called with mmap write lock
-	 * held, to avoid race with ksmd.
-	 */
-	err = ksm_execve(mm);
-	if (err)
-		goto err_ksm;
-
-	/*
-	 * Place the stack at the largest stack address the architecture
-	 * supports. Later, we'll move this to an appropriate place. We don't
-	 * use STACK_TOP because that can depend on attributes which aren't
-	 * configured yet.
-	 */
-	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_end = STACK_TOP_MAX;
-	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
-	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
-
-	err = insert_vm_struct(mm, vma);
-	if (err)
-		goto err;
-
-	mm->stack_vm = mm->total_vm = 1;
-	mmap_write_unlock(mm);
-	bprm->p = vma->vm_end - sizeof(void *);
-	return 0;
-err:
-	ksm_exit(mm);
-err_ksm:
-	mmap_write_unlock(mm);
-err_free:
-	bprm->vma = NULL;
-	vm_area_free(vma);
-	return err;
+	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
 }
 
 static bool valid_arg_len(struct linux_binprm *bprm, long len)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef22..fa84e59a99bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3223,6 +3223,8 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void exit_mmap(struct mm_struct *);
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p);
 int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
 bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
 				 unsigned long addr, bool write);
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..1289c6381419 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1717,6 +1717,77 @@ static int __meminit init_reserve_notifier(void)
 }
 subsys_initcall(init_reserve_notifier);
 
+/*
+ * Establish the stack VMA in an execve'd process, located temporarily at the
+ * maximum stack address provided by the architecture.
+ *
+ * We later relocate this downwards in relocate_vma_down().
+ *
+ * This function is almost certainly NOT what you want for anything other than
+ * early executable initialisation.
+ *
+ * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
+ * maximum addressable location in the stack (that is capable of storing a
+ * system word of data).
+ *
+ * on failure, returns an error code.
+ */
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+			  unsigned long *top_mem_p)
+{
+	int err;
+	struct vm_area_struct *vma = vm_area_alloc(mm);
+
+	if (!vma)
+		return -ENOMEM;
+
+	vma_set_anonymous(vma);
+
+	if (mmap_write_lock_killable(mm)) {
+		err = -EINTR;
+		goto err_free;
+	}
+
+	/*
+	 * Need to be called with mmap write lock
+	 * held, to avoid race with ksmd.
+	 */
+	err = ksm_execve(mm);
+	if (err)
+		goto err_ksm;
+
+	/*
+	 * Place the stack at the largest stack address the architecture
+	 * supports. Later, we'll move this to an appropriate place. We don't
+	 * use STACK_TOP because that can depend on attributes which aren't
+	 * configured yet.
+	 */
+	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_end = STACK_TOP_MAX;
+	vma->vm_start = vma->vm_end - PAGE_SIZE;
+	vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+
+	err = insert_vm_struct(mm, vma);
+	if (err)
+		goto err;
+
+	mm->stack_vm = mm->total_vm = 1;
+	mmap_write_unlock(mm);
+	*vmap = vma;
+	*top_mem_p = vma->vm_end - sizeof(void *);
+	return 0;
+
+err:
+	ksm_exit(mm);
+err_ksm:
+	mmap_write_unlock(mm);
+err_free:
+	*vmap = NULL;
+	vm_area_free(vma);
+	return err;
+}
+
 /*
  * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
  * this VMA and its relocated range, which will now reside at [vma->vm_start -
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-24 21:15 [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-24 21:15 ` [PATCH 1/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-24 21:15 ` Lorenzo Stoakes
  2025-04-24 21:22   ` David Hildenbrand
  2025-04-25  3:15   ` Kees Cook
  2025-04-24 21:15 ` [PATCH 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
  2025-04-24 21:15 ` [PATCH 4/4] mm: move vm_area_alloc,dup,free() functions to vma.c Lorenzo Stoakes
  3 siblings, 2 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Right now these are performed in kernel/fork.c which is odd and a violation
of separation of concerns, as well as preventing us from integrating this
and related logic into userland VMA testing going forward, and perhaps more
importantly - enabling us to, in a subsequent commit, make VMA
allocation/freeing a purely internal mm operation.

There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU is not set, and there is no sensible place to put these outside
of that, so we are put in the position of having to duplication some logic
here.

This isn't ideal, but since nommu is a niche use-case, already duplicates a
great deal of mmu logic by its nature and we can eliminate code that is not
applicable to nommu, it seems a worthwhile trade-off.

The intent is to move all this logic to vma.c in a subsequent commit,
rendering VMA allocation, freeing and duplication mm-internal-only and
userland testable.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/fork.c | 88 ---------------------------------------------------
 mm/mmap.c     | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/nommu.c    | 67 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 151 insertions(+), 88 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25..ebddc51624ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -428,88 +428,9 @@ struct kmem_cache *files_cachep;
 /* SLAB cache for fs_struct structures (tsk->fs) */
 struct kmem_cache *fs_cachep;
 
-/* SLAB cache for vm_area_struct structures */
-static struct kmem_cache *vm_area_cachep;
-
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
-
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (!vma)
-		return NULL;
-
-	vma_init(vma, mm);
-
-	return vma;
-}
-
-static void vm_area_init_from(const struct vm_area_struct *src,
-			      struct vm_area_struct *dest)
-{
-	dest->vm_mm = src->vm_mm;
-	dest->vm_ops = src->vm_ops;
-	dest->vm_start = src->vm_start;
-	dest->vm_end = src->vm_end;
-	dest->anon_vma = src->anon_vma;
-	dest->vm_pgoff = src->vm_pgoff;
-	dest->vm_file = src->vm_file;
-	dest->vm_private_data = src->vm_private_data;
-	vm_flags_init(dest, src->vm_flags);
-	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
-	       sizeof(dest->vm_page_prot));
-	/*
-	 * src->shared.rb may be modified concurrently when called from
-	 * dup_mmap(), but the clone will reinitialize it.
-	 */
-	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
-	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
-	       sizeof(dest->vm_userfaultfd_ctx));
-#ifdef CONFIG_ANON_VMA_NAME
-	dest->anon_name = src->anon_name;
-#endif
-#ifdef CONFIG_SWAP
-	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
-	       sizeof(dest->swap_readahead_info));
-#endif
-#ifndef CONFIG_MMU
-	dest->vm_region = src->vm_region;
-#endif
-#ifdef CONFIG_NUMA
-	dest->vm_policy = src->vm_policy;
-#endif
-}
-
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-
-	if (!new)
-		return NULL;
-
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	vm_area_init_from(orig, new);
-	vma_lock_init(new, true);
-	INIT_LIST_HEAD(&new->anon_vma_chain);
-	vma_numab_state_init(new);
-	dup_anon_vma_name(orig, new);
-
-	return new;
-}
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-	/* The vma should be detached while being destroyed. */
-	vma_assert_detached(vma);
-	vma_numab_state_free(vma);
-	free_anon_vma_name(vma);
-	kmem_cache_free(vm_area_cachep, vma);
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3214,11 +3135,6 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
-	struct kmem_cache_args args = {
-		.use_freeptr_offset = true,
-		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
-	};
-
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3235,10 +3151,6 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = kmem_cache_create("vm_area_struct",
-			sizeof(struct vm_area_struct), &args,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
-			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 1289c6381419..fbddcd082a93 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -77,6 +77,82 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
+/* SLAB cache for vm_area_struct structures */
+static struct kmem_cache *vm_area_cachep;
+
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+
+	return vma;
+}
+
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	vm_area_init_from(orig, new);
+	vma_lock_init(new, true);
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	vma_numab_state_init(new);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
+	vma_numab_state_free(vma);
+	free_anon_vma_name(vma);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
 /* Update vma->vm_page_prot to reflect vma->vm_flags. */
 void vma_set_page_prot(struct vm_area_struct *vma)
 {
@@ -1601,9 +1677,17 @@ static const struct ctl_table mmap_table[] = {
 void __init mmap_init(void)
 {
 	int ret;
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
 
 	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
 	VM_BUG_ON(ret);
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT);
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", mmap_table);
 #endif
diff --git a/mm/nommu.c b/mm/nommu.c
index 2b4d304c6445..2b3722266222 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -53,6 +53,65 @@ static struct kmem_cache *vm_region_jar;
 struct rb_root nommu_region_tree = RB_ROOT;
 DECLARE_RWSEM(nommu_region_sem);
 
+/* SLAB cache for vm_area_struct structures */
+static struct kmem_cache *vm_area_cachep;
+
+vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+
+	return vma;
+}
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+
+	dest->vm_region = src->vm_region;
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	vm_area_init_from(orig, new);
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+
+	return new;
+}
+
 const struct vm_operations_struct generic_file_vm_ops = {
 };
 
@@ -404,10 +463,18 @@ static const struct ctl_table nommu_table[] = {
 void __init mmap_init(void)
 {
 	int ret;
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
 
 	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
 	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT);
 	register_sysctl_init("vm", nommu_table);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] mm: move dup_mmap() to mm
  2025-04-24 21:15 [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
  2025-04-24 21:15 ` [PATCH 1/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
  2025-04-24 21:15 ` [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
@ 2025-04-24 21:15 ` Lorenzo Stoakes
  2025-04-25  9:13   ` Pedro Falcato
  2025-04-24 21:15 ` [PATCH 4/4] mm: move vm_area_alloc,dup,free() functions to vma.c Lorenzo Stoakes
  3 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

This is a key step in our being able to abstract and isolate VMA allocation
and destruction logic.

This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.

We do the same for the nommu version which is substantially simpler.

We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.

While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.

Suggested-by: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/fork.c | 189 ++------------------------------------------------
 mm/internal.h |   2 +
 mm/mmap.c     | 181 +++++++++++++++++++++++++++++++++++++++++++++--
 mm/nommu.c    |   8 +++
 4 files changed, 189 insertions(+), 191 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ebddc51624ec..9e4616dacd82 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -112,6 +112,9 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+/* For dup_mmap(). */
+#include "../mm/internal.h"
+
 #include <trace/events/sched.h>
 
 #define CREATE_TRACE_POINTS
@@ -510,7 +513,7 @@ void free_task(struct task_struct *tsk)
 }
 EXPORT_SYMBOL(free_task);
 
-static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	struct file *exe_file;
 
@@ -525,183 +528,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 }
 
 #ifdef CONFIG_MMU
-static __latent_entropy int dup_mmap(struct mm_struct *mm,
-					struct mm_struct *oldmm)
-{
-	struct vm_area_struct *mpnt, *tmp;
-	int retval;
-	unsigned long charge = 0;
-	LIST_HEAD(uf);
-	VMA_ITERATOR(vmi, mm, 0);
-
-	if (mmap_write_lock_killable(oldmm))
-		return -EINTR;
-	flush_cache_dup_mm(oldmm);
-	uprobe_dup_mmap(oldmm, mm);
-	/*
-	 * Not linked in yet - no deadlock potential:
-	 */
-	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
-
-	/* No ordering required: file already has been exposed. */
-	dup_mm_exe_file(mm, oldmm);
-
-	mm->total_vm = oldmm->total_vm;
-	mm->data_vm = oldmm->data_vm;
-	mm->exec_vm = oldmm->exec_vm;
-	mm->stack_vm = oldmm->stack_vm;
-
-	/* Use __mt_dup() to efficiently build an identical maple tree. */
-	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
-	if (unlikely(retval))
-		goto out;
-
-	mt_clear_in_rcu(vmi.mas.tree);
-	for_each_vma(vmi, mpnt) {
-		struct file *file;
-
-		vma_start_write(mpnt);
-		if (mpnt->vm_flags & VM_DONTCOPY) {
-			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
-						    mpnt->vm_end, GFP_KERNEL);
-			if (retval)
-				goto loop_out;
-
-			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
-			continue;
-		}
-		charge = 0;
-		/*
-		 * Don't duplicate many vmas if we've been oom-killed (for
-		 * example)
-		 */
-		if (fatal_signal_pending(current)) {
-			retval = -EINTR;
-			goto loop_out;
-		}
-		if (mpnt->vm_flags & VM_ACCOUNT) {
-			unsigned long len = vma_pages(mpnt);
-
-			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
-				goto fail_nomem;
-			charge = len;
-		}
-		tmp = vm_area_dup(mpnt);
-		if (!tmp)
-			goto fail_nomem;
-
-		/* track_pfn_copy() will later take care of copying internal state. */
-		if (unlikely(tmp->vm_flags & VM_PFNMAP))
-			untrack_pfn_clear(tmp);
-
-		retval = vma_dup_policy(mpnt, tmp);
-		if (retval)
-			goto fail_nomem_policy;
-		tmp->vm_mm = mm;
-		retval = dup_userfaultfd(tmp, &uf);
-		if (retval)
-			goto fail_nomem_anon_vma_fork;
-		if (tmp->vm_flags & VM_WIPEONFORK) {
-			/*
-			 * VM_WIPEONFORK gets a clean slate in the child.
-			 * Don't prepare anon_vma until fault since we don't
-			 * copy page for current vma.
-			 */
-			tmp->anon_vma = NULL;
-		} else if (anon_vma_fork(tmp, mpnt))
-			goto fail_nomem_anon_vma_fork;
-		vm_flags_clear(tmp, VM_LOCKED_MASK);
-		/*
-		 * Copy/update hugetlb private vma information.
-		 */
-		if (is_vm_hugetlb_page(tmp))
-			hugetlb_dup_vma_private(tmp);
-
-		/*
-		 * Link the vma into the MT. After using __mt_dup(), memory
-		 * allocation is not necessary here, so it cannot fail.
-		 */
-		vma_iter_bulk_store(&vmi, tmp);
-
-		mm->map_count++;
-
-		if (tmp->vm_ops && tmp->vm_ops->open)
-			tmp->vm_ops->open(tmp);
-
-		file = tmp->vm_file;
-		if (file) {
-			struct address_space *mapping = file->f_mapping;
-
-			get_file(file);
-			i_mmap_lock_write(mapping);
-			if (vma_is_shared_maywrite(tmp))
-				mapping_allow_writable(mapping);
-			flush_dcache_mmap_lock(mapping);
-			/* insert tmp into the share list, just after mpnt */
-			vma_interval_tree_insert_after(tmp, mpnt,
-					&mapping->i_mmap);
-			flush_dcache_mmap_unlock(mapping);
-			i_mmap_unlock_write(mapping);
-		}
-
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(tmp, mpnt);
-
-		if (retval) {
-			mpnt = vma_next(&vmi);
-			goto loop_out;
-		}
-	}
-	/* a new mm has just been created */
-	retval = arch_dup_mmap(oldmm, mm);
-loop_out:
-	vma_iter_free(&vmi);
-	if (!retval) {
-		mt_set_in_rcu(vmi.mas.tree);
-		ksm_fork(mm, oldmm);
-		khugepaged_fork(mm, oldmm);
-	} else {
-
-		/*
-		 * The entire maple tree has already been duplicated. If the
-		 * mmap duplication fails, mark the failure point with
-		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
-		 * stop releasing VMAs that have not been duplicated after this
-		 * point.
-		 */
-		if (mpnt) {
-			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
-			mas_store(&vmi.mas, XA_ZERO_ENTRY);
-			/* Avoid OOM iterating a broken tree */
-			set_bit(MMF_OOM_SKIP, &mm->flags);
-		}
-		/*
-		 * The mm_struct is going to exit, but the locks will be dropped
-		 * first.  Set the mm_struct as unstable is advisable as it is
-		 * not fully initialised.
-		 */
-		set_bit(MMF_UNSTABLE, &mm->flags);
-	}
-out:
-	mmap_write_unlock(mm);
-	flush_tlb_mm(oldmm);
-	mmap_write_unlock(oldmm);
-	if (!retval)
-		dup_userfaultfd_complete(&uf);
-	else
-		dup_userfaultfd_fail(&uf);
-	return retval;
-
-fail_nomem_anon_vma_fork:
-	mpol_put(vma_policy(tmp));
-fail_nomem_policy:
-	vm_area_free(tmp);
-fail_nomem:
-	retval = -ENOMEM;
-	vm_unacct_memory(charge);
-	goto loop_out;
-}
-
 static inline int mm_alloc_pgd(struct mm_struct *mm)
 {
 	mm->pgd = pgd_alloc(mm);
@@ -715,13 +541,6 @@ static inline void mm_free_pgd(struct mm_struct *mm)
 	pgd_free(mm, mm->pgd);
 }
 #else
-static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
-{
-	mmap_write_lock(oldmm);
-	dup_mm_exe_file(mm, oldmm);
-	mmap_write_unlock(oldmm);
-	return 0;
-}
 #define mm_alloc_pgd(mm)	(0)
 #define mm_free_pgd(mm)
 #endif /* CONFIG_MMU */
diff --git a/mm/internal.h b/mm/internal.h
index 838f840ded83..39067b3117a4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1630,5 +1630,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
 }
 #endif /* CONFIG_PT_RECLAIM */
 
+void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index fbddcd082a93..31d2aa690fcc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1955,7 +1955,6 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
 	return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
 }
 
-#ifdef CONFIG_MMU
 /*
  * Obtain a read lock on mm->mmap_lock, if the specified address is below the
  * start of the VMA, the intent is to perform a write, and it is a
@@ -1999,10 +1998,180 @@ bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
 	mmap_write_downgrade(mm);
 	return true;
 }
-#else
-bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
-				 unsigned long addr, bool write)
+
+__latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	return false;
+	struct vm_area_struct *mpnt, *tmp;
+	int retval;
+	unsigned long charge = 0;
+	LIST_HEAD(uf);
+	VMA_ITERATOR(vmi, mm, 0);
+
+	if (mmap_write_lock_killable(oldmm))
+		return -EINTR;
+	flush_cache_dup_mm(oldmm);
+	uprobe_dup_mmap(oldmm, mm);
+	/*
+	 * Not linked in yet - no deadlock potential:
+	 */
+	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
+
+	/* No ordering required: file already has been exposed. */
+	dup_mm_exe_file(mm, oldmm);
+
+	mm->total_vm = oldmm->total_vm;
+	mm->data_vm = oldmm->data_vm;
+	mm->exec_vm = oldmm->exec_vm;
+	mm->stack_vm = oldmm->stack_vm;
+
+	/* Use __mt_dup() to efficiently build an identical maple tree. */
+	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
+	if (unlikely(retval))
+		goto out;
+
+	mt_clear_in_rcu(vmi.mas.tree);
+	for_each_vma(vmi, mpnt) {
+		struct file *file;
+
+		vma_start_write(mpnt);
+		if (mpnt->vm_flags & VM_DONTCOPY) {
+			retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
+						    mpnt->vm_end, GFP_KERNEL);
+			if (retval)
+				goto loop_out;
+
+			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+			continue;
+		}
+		charge = 0;
+		/*
+		 * Don't duplicate many vmas if we've been oom-killed (for
+		 * example)
+		 */
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto loop_out;
+		}
+		if (mpnt->vm_flags & VM_ACCOUNT) {
+			unsigned long len = vma_pages(mpnt);
+
+			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
+				goto fail_nomem;
+			charge = len;
+		}
+
+		tmp = vm_area_dup(mpnt);
+		if (!tmp)
+			goto fail_nomem;
+
+		/* track_pfn_copy() will later take care of copying internal state. */
+		if (unlikely(tmp->vm_flags & VM_PFNMAP))
+			untrack_pfn_clear(tmp);
+
+		retval = vma_dup_policy(mpnt, tmp);
+		if (retval)
+			goto fail_nomem_policy;
+		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
+		if (tmp->vm_flags & VM_WIPEONFORK) {
+			/*
+			 * VM_WIPEONFORK gets a clean slate in the child.
+			 * Don't prepare anon_vma until fault since we don't
+			 * copy page for current vma.
+			 */
+			tmp->anon_vma = NULL;
+		} else if (anon_vma_fork(tmp, mpnt))
+			goto fail_nomem_anon_vma_fork;
+		vm_flags_clear(tmp, VM_LOCKED_MASK);
+		/*
+		 * Copy/update hugetlb private vma information.
+		 */
+		if (is_vm_hugetlb_page(tmp))
+			hugetlb_dup_vma_private(tmp);
+
+		/*
+		 * Link the vma into the MT. After using __mt_dup(), memory
+		 * allocation is not necessary here, so it cannot fail.
+		 */
+		vma_iter_bulk_store(&vmi, tmp);
+
+		mm->map_count++;
+
+		if (tmp->vm_ops && tmp->vm_ops->open)
+			tmp->vm_ops->open(tmp);
+
+		file = tmp->vm_file;
+		if (file) {
+			struct address_space *mapping = file->f_mapping;
+
+			get_file(file);
+			i_mmap_lock_write(mapping);
+			if (vma_is_shared_maywrite(tmp))
+				mapping_allow_writable(mapping);
+			flush_dcache_mmap_lock(mapping);
+			/* insert tmp into the share list, just after mpnt */
+			vma_interval_tree_insert_after(tmp, mpnt,
+					&mapping->i_mmap);
+			flush_dcache_mmap_unlock(mapping);
+			i_mmap_unlock_write(mapping);
+		}
+
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			retval = copy_page_range(tmp, mpnt);
+
+		if (retval) {
+			mpnt = vma_next(&vmi);
+			goto loop_out;
+		}
+	}
+	/* a new mm has just been created */
+	retval = arch_dup_mmap(oldmm, mm);
+loop_out:
+	vma_iter_free(&vmi);
+	if (!retval) {
+		mt_set_in_rcu(vmi.mas.tree);
+		ksm_fork(mm, oldmm);
+		khugepaged_fork(mm, oldmm);
+	} else {
+
+		/*
+		 * The entire maple tree has already been duplicated. If the
+		 * mmap duplication fails, mark the failure point with
+		 * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
+		 * stop releasing VMAs that have not been duplicated after this
+		 * point.
+		 */
+		if (mpnt) {
+			mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
+			mas_store(&vmi.mas, XA_ZERO_ENTRY);
+			/* Avoid OOM iterating a broken tree */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		/*
+		 * The mm_struct is going to exit, but the locks will be dropped
+		 * first.  Set the mm_struct as unstable is advisable as it is
+		 * not fully initialised.
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+	}
+out:
+	mmap_write_unlock(mm);
+	flush_tlb_mm(oldmm);
+	mmap_write_unlock(oldmm);
+	if (!retval)
+		dup_userfaultfd_complete(&uf);
+	else
+		dup_userfaultfd_fail(&uf);
+	return retval;
+
+fail_nomem_anon_vma_fork:
+	mpol_put(vma_policy(tmp));
+fail_nomem_policy:
+	vm_area_free(tmp);
+fail_nomem:
+	retval = -ENOMEM;
+	vm_unacct_memory(charge);
+	goto loop_out;
 }
-#endif
diff --git a/mm/nommu.c b/mm/nommu.c
index 2b3722266222..79a6b0460622 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1941,3 +1941,11 @@ static int __meminit init_admin_reserve(void)
 	return 0;
 }
 subsys_initcall(init_admin_reserve);
+
+int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	mmap_write_lock(oldmm);
+	dup_mm_exe_file(mm, oldmm);
+	mmap_write_unlock(oldmm);
+	return 0;
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] mm: move vm_area_alloc,dup,free() functions to vma.c
  2025-04-24 21:15 [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-04-24 21:15 ` [PATCH 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
@ 2025-04-24 21:15 ` Lorenzo Stoakes
  3 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

Place the core VMA allocation, duplication, and freeing functions in vma.c
where we can isolate them to mm only providing proper encapsulation, and
add have the ability to integrate userland tests.

The only users of these functions are now either in mmap.c, vma.c or
nommu.c, the former two not being compiled for nommu so we need not make
any changes for nommu here.

Update the tools/testing/vma/* testing code appropriately to keep the tests
passing.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               |  13 ---
 mm/debug_vm_pgtable.c            |   2 +
 mm/mmap.c                        |  86 +-----------------
 mm/nommu.c                       |   2 +-
 mm/vma.c                         |  89 ++++++++++++++++++
 mm/vma.h                         |   8 ++
 tools/testing/vma/vma.c          |   1 +
 tools/testing/vma/vma_internal.h | 151 ++++++++++++++++++++++++-------
 8 files changed, 221 insertions(+), 131 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa84e59a99bb..683df06e4a18 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -222,19 +222,6 @@ static inline struct folio *lru_to_folio(struct list_head *head)
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
 
-/*
- * Linux kernel virtual memory manager primitives.
- * The idea being to have a "virtual" mm in the same way
- * we have a virtual fs - giving a cleaner interface to the
- * mm details, and allowing different kinds of memory mappings
- * (from shared memory to executable loading to arbitrary
- * mmap() functions).
- */
-
-struct vm_area_struct *vm_area_alloc(struct mm_struct *);
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
-void vm_area_free(struct vm_area_struct *);
-
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
 extern struct rw_semaphore nommu_region_sem;
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 7731b238b534..556c3f204d1b 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -36,6 +36,8 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
+#include "vma.h"
+
 /*
  * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics
  * expectations that are being validated here. All future changes in here
diff --git a/mm/mmap.c b/mm/mmap.c
index 31d2aa690fcc..0059a791cb7d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -77,82 +77,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-/* SLAB cache for vm_area_struct structures */
-static struct kmem_cache *vm_area_cachep;
-
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
-
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (!vma)
-		return NULL;
-
-	vma_init(vma, mm);
-
-	return vma;
-}
-
-static void vm_area_init_from(const struct vm_area_struct *src,
-			      struct vm_area_struct *dest)
-{
-	dest->vm_mm = src->vm_mm;
-	dest->vm_ops = src->vm_ops;
-	dest->vm_start = src->vm_start;
-	dest->vm_end = src->vm_end;
-	dest->anon_vma = src->anon_vma;
-	dest->vm_pgoff = src->vm_pgoff;
-	dest->vm_file = src->vm_file;
-	dest->vm_private_data = src->vm_private_data;
-	vm_flags_init(dest, src->vm_flags);
-	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
-	       sizeof(dest->vm_page_prot));
-	/*
-	 * src->shared.rb may be modified concurrently when called from
-	 * dup_mmap(), but the clone will reinitialize it.
-	 */
-	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
-	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
-	       sizeof(dest->vm_userfaultfd_ctx));
-#ifdef CONFIG_ANON_VMA_NAME
-	dest->anon_name = src->anon_name;
-#endif
-#ifdef CONFIG_SWAP
-	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
-	       sizeof(dest->swap_readahead_info));
-#endif
-#ifdef CONFIG_NUMA
-	dest->vm_policy = src->vm_policy;
-#endif
-}
-
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-
-	if (!new)
-		return NULL;
-
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	vm_area_init_from(orig, new);
-	vma_lock_init(new, true);
-	INIT_LIST_HEAD(&new->anon_vma_chain);
-	vma_numab_state_init(new);
-	dup_anon_vma_name(orig, new);
-
-	return new;
-}
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-	/* The vma should be detached while being destroyed. */
-	vma_assert_detached(vma);
-	vma_numab_state_free(vma);
-	free_anon_vma_name(vma);
-	kmem_cache_free(vm_area_cachep, vma);
-}
-
 /* Update vma->vm_page_prot to reflect vma->vm_flags. */
 void vma_set_page_prot(struct vm_area_struct *vma)
 {
@@ -1677,17 +1601,11 @@ static const struct ctl_table mmap_table[] = {
 void __init mmap_init(void)
 {
 	int ret;
-	struct kmem_cache_args args = {
-		.use_freeptr_offset = true,
-		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
-	};
 
 	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
 	VM_BUG_ON(ret);
-	vm_area_cachep = kmem_cache_create("vm_area_struct",
-			sizeof(struct vm_area_struct), &args,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
-			SLAB_ACCOUNT);
+
+	vma_state_init();
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", mmap_table);
 #endif
diff --git a/mm/nommu.c b/mm/nommu.c
index 79a6b0460622..7d1ced10e08b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -97,7 +97,7 @@ static void vm_area_init_from(const struct vm_area_struct *src,
 	dest->vm_region = src->vm_region;
 }
 
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+static struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
 	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 
diff --git a/mm/vma.c b/mm/vma.c
index 8a6c5e835759..65507f12b8d3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -57,6 +57,9 @@ struct mmap_state {
 		.state = VMA_MERGE_START,				\
 	}
 
+/* SLAB cache for vm_area_struct structures */
+static struct kmem_cache *vm_area_cachep;
+
 /*
  * If, at any point, the VMA had unCoW'd mappings from parents, it will maintain
  * more than one anon_vma_chain connecting it to more than one anon_vma. A merge
@@ -3052,3 +3055,89 @@ int __vm_munmap(unsigned long start, size_t len, bool unlock)
 	userfaultfd_unmap_complete(mm, &uf);
 	return ret;
 }
+
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+
+	return vma;
+}
+
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	vm_area_init_from(orig, new);
+	vma_lock_init(new, true);
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	vma_numab_state_init(new);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
+	vma_numab_state_free(vma);
+	free_anon_vma_name(vma);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+void __init vma_state_init(void)
+{
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT);
+}
diff --git a/mm/vma.h b/mm/vma.h
index 149926e8a6d1..be8597a85b07 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -548,4 +548,12 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address);
 
 int __vm_munmap(unsigned long start, size_t len, bool unlock);
 
+#ifdef CONFIG_MMU
+struct vm_area_struct *vm_area_alloc(struct mm_struct *);
+void vm_area_free(struct vm_area_struct *);
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig);
+#endif
+
+extern void __init vma_state_init(void);
+
 #endif	/* __MM_VMA_H */
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 7cfd6e31db10..9932ceb2e4b9 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -1667,6 +1667,7 @@ int main(void)
 	int num_tests = 0, num_fail = 0;
 
 	maple_tree_init();
+	vma_state_init();
 
 #define TEST(name)							\
 	do {								\
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 572ab2cea763..fbd0412571fb 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -135,6 +135,10 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
+#define data_race(expr) expr
+
+#define ASSERT_EXCLUSIVE_WRITER(x)
+
 struct kref {
 	refcount_t refcount;
 };
@@ -235,6 +239,8 @@ struct file {
 
 #define VMA_LOCK_OFFSET	0x40000000
 
+typedef struct { unsigned long v; } freeptr_t;
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -244,9 +250,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
@@ -421,6 +425,65 @@ struct vm_unmapped_area_info {
 	unsigned long start_gap;
 };
 
+struct kmem_cache_args {
+	/**
+	 * @align: The required alignment for the objects.
+	 *
+	 * %0 means no specific alignment is requested.
+	 */
+	unsigned int align;
+	/**
+	 * @useroffset: Usercopy region offset.
+	 *
+	 * %0 is a valid offset, when @usersize is non-%0
+	 */
+	unsigned int useroffset;
+	/**
+	 * @usersize: Usercopy region size.
+	 *
+	 * %0 means no usercopy region is specified.
+	 */
+	unsigned int usersize;
+	/**
+	 * @freeptr_offset: Custom offset for the free pointer
+	 * in &SLAB_TYPESAFE_BY_RCU caches
+	 *
+	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
+	 * outside of the object. This might cause the object to grow in size.
+	 * Cache creators that have a reason to avoid this can specify a custom
+	 * free pointer offset in their struct where the free pointer will be
+	 * placed.
+	 *
+	 * Note that placing the free pointer inside the object requires the
+	 * caller to ensure that no fields are invalidated that are required to
+	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
+	 * details).
+	 *
+	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
+	 * is specified, %use_freeptr_offset must be set %true.
+	 *
+	 * Note that @ctor currently isn't supported with custom free pointers
+	 * as a @ctor requires an external free pointer.
+	 */
+	unsigned int freeptr_offset;
+	/**
+	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
+	 */
+	bool use_freeptr_offset;
+	/**
+	 * @ctor: A constructor for the objects.
+	 *
+	 * The constructor is invoked for each object in a newly allocated slab
+	 * page. It is the cache user's responsibility to free object in the
+	 * same state as after calling the constructor, or deal appropriately
+	 * with any differences between a freshly constructed and a reallocated
+	 * object.
+	 *
+	 * %NULL means no constructor.
+	 */
+	void (*ctor)(void *);
+};
+
 static inline void vma_iter_invalidate(struct vma_iterator *vmi)
 {
 	mas_pause(&vmi->mas);
@@ -496,40 +559,39 @@ extern const struct vm_operations_struct vma_dummy_vm_ops;
 
 extern unsigned long rlimit(unsigned int limit);
 
-static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
-{
-	memset(vma, 0, sizeof(*vma));
-	vma->vm_mm = mm;
-	vma->vm_ops = &vma_dummy_vm_ops;
-	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma->vm_lock_seq = UINT_MAX;
-}
 
-static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct));
+struct kmem_cache {
+	const char *name;
+	size_t object_size;
+	struct kmem_cache_args *args;
+};
 
-	if (!vma)
-		return NULL;
+static inline struct kmem_cache *__kmem_cache_create(const char *name,
+						     size_t object_size,
+						     struct kmem_cache_args *args)
+{
+	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
 
-	vma_init(vma, mm);
+	ret->name = name;
+	ret->object_size = object_size;
+	ret->args = args;
 
-	return vma;
+	return ret;
 }
 
-static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
-{
-	struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct));
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	__kmem_cache_create((__name), (__object_size), (__args))
 
-	if (!new)
-		return NULL;
+static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	(void)gfpflags;
 
-	memcpy(new, orig, sizeof(*new));
-	refcount_set(&new->vm_refcnt, 0);
-	new->vm_lock_seq = UINT_MAX;
-	INIT_LIST_HEAD(&new->anon_vma_chain);
+	return calloc(s->object_size, 1);
+}
 
-	return new;
+static inline void kmem_cache_free(struct kmem_cache *s, void *x)
+{
+	free(x);
 }
 
 /*
@@ -696,11 +758,6 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void lru_add_drain(void)
 {
 }
@@ -1240,4 +1297,32 @@ static inline int mapping_map_writable(struct address_space *mapping)
 	return 0;
 }
 
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
+{
+	(void)vma;
+	(void)reset_refcnt;
+}
+
+static inline void vma_numab_state_init(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void vma_numab_state_free(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
+static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
+				     struct vm_area_struct *new_vma)
+{
+	(void)orig_vma;
+	(void)new_vma;
+}
+
+static inline void free_anon_vma_name(struct vm_area_struct *vma)
+{
+	(void)vma;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-24 21:15 ` [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
@ 2025-04-24 21:22   ` David Hildenbrand
  2025-04-25  1:22     ` Suren Baghdasaryan
  2025-04-25 10:17     ` Lorenzo Stoakes
  2025-04-25  3:15   ` Kees Cook
  1 sibling, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-04-24 21:22 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 24.04.25 23:15, Lorenzo Stoakes wrote:
> Right now these are performed in kernel/fork.c which is odd and a violation
> of separation of concerns, as well as preventing us from integrating this
> and related logic into userland VMA testing going forward, and perhaps more
> importantly - enabling us to, in a subsequent commit, make VMA
> allocation/freeing a purely internal mm operation.
> 
> There is a fly in the ointment - nommu - mmap.c is not compiled if
> CONFIG_MMU is not set, and there is no sensible place to put these outside
> of that, so we are put in the position of having to duplication some logic
> here.
> 
> This isn't ideal, but since nommu is a niche use-case, already duplicates a
> great deal of mmu logic by its nature and we can eliminate code that is not
> applicable to nommu, it seems a worthwhile trade-off.
> 
> The intent is to move all this logic to vma.c in a subsequent commit,
> rendering VMA allocation, freeing and duplication mm-internal-only and
> userland testable.

I'm pretty sure you tried it, but what's the big blocker to have patch 
#3 first, so we can avoid the temporary move of the code to mmap.c ?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: abstract initial stack setup to mm subsystem
  2025-04-24 21:15 ` [PATCH 1/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
@ 2025-04-24 21:30   ` David Hildenbrand
  2025-04-25  0:55     ` Suren Baghdasaryan
  2025-04-25 10:11     ` Lorenzo Stoakes
  0 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-04-24 21:30 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel

On 24.04.25 23:15, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
...

> +/*
> + * Establish the stack VMA in an execve'd process, located temporarily at the
> + * maximum stack address provided by the architecture.
> + *
> + * We later relocate this downwards in relocate_vma_down().
> + *
> + * This function is almost certainly NOT what you want for anything other than
> + * early executable initialisation.
> + *
> + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
> + * maximum addressable location in the stack (that is capable of storing a
> + * system word of data).
> + *
> + * on failure, returns an error code.
> + */

I was about to say, if you already write that much documentation, why 
not turn it into kerneldoc? :) But this function is clearly not intended 
to have more than one caller, so ... :)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: abstract initial stack setup to mm subsystem
  2025-04-24 21:30   ` David Hildenbrand
@ 2025-04-25  0:55     ` Suren Baghdasaryan
  2025-04-25 10:10       ` Lorenzo Stoakes
  2025-04-25 10:11     ` Lorenzo Stoakes
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25  0:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 2:30 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > There are peculiarities within the kernel where what is very clearly mm
> > code is performed elsewhere arbitrarily.
> >
> > This violates separation of concerns and makes it harder to refactor code
> > to make changes to how fundamental initialisation and operation of mm logic
> > is performed.
> >
> > One such case is the creation of the VMA containing the initial stack upon
> > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > in fs/exec.c.
> >
> > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > use of vma allocation and free code to fork and mm only.
> >
> > We previously did the same for the step at which we relocate the initial
> > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > establishment too.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> ...
>
> > +/*
> > + * Establish the stack VMA in an execve'd process, located temporarily at the
> > + * maximum stack address provided by the architecture.
> > + *
> > + * We later relocate this downwards in relocate_vma_down().
> > + *
> > + * This function is almost certainly NOT what you want for anything other than
> > + * early executable initialisation.
> > + *
> > + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
> > + * maximum addressable location in the stack (that is capable of storing a
> > + * system word of data).
> > + *
> > + * on failure, returns an error code.

nit: s/on/On
You could also skip this sentence altogether since it's kinda obvious
but up to you.

> > + */
>
> I was about to say, if you already write that much documentation, why
> not turn it into kerneldoc? :) But this function is clearly not intended
> to have more than one caller, so ... :)
>
> Acked-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-24 21:22   ` David Hildenbrand
@ 2025-04-25  1:22     ` Suren Baghdasaryan
  2025-04-25  1:37       ` Suren Baghdasaryan
  2025-04-25 10:09       ` Lorenzo Stoakes
  2025-04-25 10:17     ` Lorenzo Stoakes
  1 sibling, 2 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25  1:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > Right now these are performed in kernel/fork.c which is odd and a violation
> > of separation of concerns, as well as preventing us from integrating this
> > and related logic into userland VMA testing going forward, and perhaps more
> > importantly - enabling us to, in a subsequent commit, make VMA
> > allocation/freeing a purely internal mm operation.
> >
> > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > of that, so we are put in the position of having to duplication some logic

s/to duplication/to duplicate

> > here.
> >
> > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > great deal of mmu logic by its nature and we can eliminate code that is not
> > applicable to nommu, it seems a worthwhile trade-off.
> >
> > The intent is to move all this logic to vma.c in a subsequent commit,
> > rendering VMA allocation, freeing and duplication mm-internal-only and
> > userland testable.
>
> I'm pretty sure you tried it, but what's the big blocker to have patch
> #3 first, so we can avoid the temporary move of the code to mmap.c ?

Completely agree with David.
I peeked into 4/4 and it seems you want to keep vma.c completely
CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
IMHO it would be much cleaner to move these functions into vma.c from
the beginning and have an #ifdef CONFIG_MMU there like this:

mm/vma.c

/* Functions identical for MMU/NOMMU */
struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) {...}
void __init vma_state_init(void) {...}

#ifdef CONFIG_MMU
static void vm_area_init_from(const struct vm_area_struct *src,
                             struct vm_area_struct *dest) {...}
struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
void vm_area_free(struct vm_area_struct *vma) {...}
#else /* CONFIG_MMU */
static void vm_area_init_from(const struct vm_area_struct *src,
                             struct vm_area_struct *dest) {...}
struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
void vm_area_free(struct vm_area_struct *vma) {...}
#endif /* CONFIG_MMU */





>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25  1:22     ` Suren Baghdasaryan
@ 2025-04-25  1:37       ` Suren Baghdasaryan
  2025-04-25 10:10         ` Lorenzo Stoakes
  2025-04-25 10:09       ` Lorenzo Stoakes
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25  1:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 6:22 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > of separation of concerns, as well as preventing us from integrating this
> > > and related logic into userland VMA testing going forward, and perhaps more
> > > importantly - enabling us to, in a subsequent commit, make VMA
> > > allocation/freeing a purely internal mm operation.
> > >
> > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > of that, so we are put in the position of having to duplication some logic
>
> s/to duplication/to duplicate
>
> > > here.
> > >
> > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > applicable to nommu, it seems a worthwhile trade-off.
> > >
> > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > userland testable.
> >
> > I'm pretty sure you tried it, but what's the big blocker to have patch
> > #3 first, so we can avoid the temporary move of the code to mmap.c ?
>
> Completely agree with David.
> I peeked into 4/4 and it seems you want to keep vma.c completely
> CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> IMHO it would be much cleaner to move these functions into vma.c from
> the beginning and have an #ifdef CONFIG_MMU there like this:
>
> mm/vma.c
>
> /* Functions identical for MMU/NOMMU */
> struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) {...}
> void __init vma_state_init(void) {...}
>
> #ifdef CONFIG_MMU
> static void vm_area_init_from(const struct vm_area_struct *src,
>                              struct vm_area_struct *dest) {...}
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> void vm_area_free(struct vm_area_struct *vma) {...}
> #else /* CONFIG_MMU */
> static void vm_area_init_from(const struct vm_area_struct *src,
>                              struct vm_area_struct *dest) {...}
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> void vm_area_free(struct vm_area_struct *vma) {...}
> #endif /* CONFIG_MMU */

3/4 and 4/4 look reasonable but they can change substantially
depending on your answer to my suggestion above, so I'll wait for your
answer before moving forward.
Thanks for doing this!
Suren.

>
>
>
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-24 21:15 ` [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
  2025-04-24 21:22   ` David Hildenbrand
@ 2025-04-25  3:15   ` Kees Cook
  2025-04-25 10:40     ` Lorenzo Stoakes
  1 sibling, 1 reply; 31+ messages in thread
From: Kees Cook @ 2025-04-25  3:15 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	David Hildenbrand, Alexander Viro, Christian Brauner, Jan Kara,
	Suren Baghdasaryan, linux-mm, linux-fsdevel, linux-kernel



On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>+static void vm_area_init_from(const struct vm_area_struct *src,
>+			      struct vm_area_struct *dest)
>+{
>+	dest->vm_mm = src->vm_mm;
>+	dest->vm_ops = src->vm_ops;
>+	dest->vm_start = src->vm_start;
>+	dest->vm_end = src->vm_end;
>+	dest->anon_vma = src->anon_vma;
>+	dest->vm_pgoff = src->vm_pgoff;
>+	dest->vm_file = src->vm_file;
>+	dest->vm_private_data = src->vm_private_data;
>+	vm_flags_init(dest, src->vm_flags);
>+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
>+	       sizeof(dest->vm_page_prot));
>+	/*
>+	 * src->shared.rb may be modified concurrently when called from
>+	 * dup_mmap(), but the clone will reinitialize it.
>+	 */
>+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
>+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
>+	       sizeof(dest->vm_userfaultfd_ctx));
>+#ifdef CONFIG_ANON_VMA_NAME
>+	dest->anon_name = src->anon_name;
>+#endif
>+#ifdef CONFIG_SWAP
>+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
>+	       sizeof(dest->swap_readahead_info));
>+#endif
>+#ifdef CONFIG_NUMA
>+	dest->vm_policy = src->vm_policy;
>+#endif
>+}

I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:

*dest = *src;

And then do any one-off cleanups?


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: move dup_mmap() to mm
  2025-04-24 21:15 ` [PATCH 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
@ 2025-04-25  9:13   ` Pedro Falcato
  2025-04-25 10:18     ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Pedro Falcato @ 2025-04-25  9:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 10:15:28PM +0100, Lorenzo Stoakes wrote:
> This is a key step in our being able to abstract and isolate VMA allocation
> and destruction logic.
> 
> This function is the last one where vm_area_free() and vm_area_dup() are
> directly referenced outside of mmap, so having this in mm allows us to
> isolate these.
> 
> We do the same for the nommu version which is substantially simpler.
> 
> We place the declaration for dup_mmap() in mm/internal.h and have
> kernel/fork.c import this in order to prevent improper use of this
> functionality elsewhere in the kernel.
> 
> While we're here, we remove the useless #ifdef CONFIG_MMU check around
> mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> CONFIG_MMU is set.
> 
> Suggested-by: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Have I told you how awesome you are? Thank you so much for the series!

-- 
Pedro

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25  1:22     ` Suren Baghdasaryan
  2025-04-25  1:37       ` Suren Baghdasaryan
@ 2025-04-25 10:09       ` Lorenzo Stoakes
  2025-04-25 10:26         ` Liam R. Howlett
  1 sibling, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Apr 24, 2025 at 06:22:30PM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > of separation of concerns, as well as preventing us from integrating this
> > > and related logic into userland VMA testing going forward, and perhaps more
> > > importantly - enabling us to, in a subsequent commit, make VMA
> > > allocation/freeing a purely internal mm operation.
> > >
> > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > of that, so we are put in the position of having to duplication some logic
>
> s/to duplication/to duplicate

Ack will fix!

>
> > > here.
> > >
> > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > applicable to nommu, it seems a worthwhile trade-off.
> > >
> > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > userland testable.
> >
> > I'm pretty sure you tried it, but what's the big blocker to have patch
> > #3 first, so we can avoid the temporary move of the code to mmap.c ?
>
> Completely agree with David.

Ack! Yes this was a little bit of a silly one :P

> I peeked into 4/4 and it seems you want to keep vma.c completely
> CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> IMHO it would be much cleaner to move these functions into vma.c from
> the beginning and have an #ifdef CONFIG_MMU there like this:
>
> mm/vma.c

This isn't really workable, as the _entire file_ basically contains
CONFIG_MMU-specific stuff. so it'd be one huge #ifdef CONFIG_MMU block with
one small #else block. It'd also be asking for bugs and issues in nommu.

I think doing it this way fits the patterns we have established for
nommu/mmap separation, and I would say nommu is enough of a niche edge case
for us to really not want to have to go to great lengths to find ways of
sharing code.

I am quite concerned about us having to consider it and deal with issues
around it so often, so want to try to avoid that as much as we can,
ideally.

>
> /* Functions identical for MMU/NOMMU */
> struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) {...}
> void __init vma_state_init(void) {...}
>
> #ifdef CONFIG_MMU
> static void vm_area_init_from(const struct vm_area_struct *src,
>                              struct vm_area_struct *dest) {...}
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> void vm_area_free(struct vm_area_struct *vma) {...}
> #else /* CONFIG_MMU */
> static void vm_area_init_from(const struct vm_area_struct *src,
>                              struct vm_area_struct *dest) {...}
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> void vm_area_free(struct vm_area_struct *vma) {...}
> #endif /* CONFIG_MMU */
>
>
>
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25  1:37       ` Suren Baghdasaryan
@ 2025-04-25 10:10         ` Lorenzo Stoakes
  2025-04-25 11:04           ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Apr 24, 2025 at 06:37:39PM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 24, 2025 at 6:22 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > of separation of concerns, as well as preventing us from integrating this
> > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > allocation/freeing a purely internal mm operation.
> > > >
> > > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > > of that, so we are put in the position of having to duplication some logic
> >
> > s/to duplication/to duplicate
> >
> > > > here.
> > > >
> > > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > > applicable to nommu, it seems a worthwhile trade-off.
> > > >
> > > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > > userland testable.
> > >
> > > I'm pretty sure you tried it, but what's the big blocker to have patch
> > > #3 first, so we can avoid the temporary move of the code to mmap.c ?
> >
> > Completely agree with David.
> > I peeked into 4/4 and it seems you want to keep vma.c completely
> > CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> > IMHO it would be much cleaner to move these functions into vma.c from
> > the beginning and have an #ifdef CONFIG_MMU there like this:
> >
> > mm/vma.c
> >
> > /* Functions identical for MMU/NOMMU */
> > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) {...}
> > void __init vma_state_init(void) {...}
> >
> > #ifdef CONFIG_MMU
> > static void vm_area_init_from(const struct vm_area_struct *src,
> >                              struct vm_area_struct *dest) {...}
> > struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> > void vm_area_free(struct vm_area_struct *vma) {...}
> > #else /* CONFIG_MMU */
> > static void vm_area_init_from(const struct vm_area_struct *src,
> >                              struct vm_area_struct *dest) {...}
> > struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> > void vm_area_free(struct vm_area_struct *vma) {...}
> > #endif /* CONFIG_MMU */
>
> 3/4 and 4/4 look reasonable but they can change substantially
> depending on your answer to my suggestion above, so I'll wait for your
> answer before moving forward.
> Thanks for doing this!
> Suren.

You're welcome :)

Well I will be fixing the issue David raised of course :) but as stated in
previous email, I don't feel it makes sense to put nommu stuff in vma.c really.

>
> >
> >
> >
> >
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: abstract initial stack setup to mm subsystem
  2025-04-25  0:55     ` Suren Baghdasaryan
@ 2025-04-25 10:10       ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Apr 24, 2025 at 05:55:20PM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 24, 2025 at 2:30 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > There are peculiarities within the kernel where what is very clearly mm
> > > code is performed elsewhere arbitrarily.
> > >
> > > This violates separation of concerns and makes it harder to refactor code
> > > to make changes to how fundamental initialisation and operation of mm logic
> > > is performed.
> > >
> > > One such case is the creation of the VMA containing the initial stack upon
> > > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > > in fs/exec.c.
> > >
> > > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > > use of vma allocation and free code to fork and mm only.
> > >
> > > We previously did the same for the step at which we relocate the initial
> > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > > establishment too.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > ---
> > ...
> >
> > > +/*
> > > + * Establish the stack VMA in an execve'd process, located temporarily at the
> > > + * maximum stack address provided by the architecture.
> > > + *
> > > + * We later relocate this downwards in relocate_vma_down().
> > > + *
> > > + * This function is almost certainly NOT what you want for anything other than
> > > + * early executable initialisation.
> > > + *
> > > + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
> > > + * maximum addressable location in the stack (that is capable of storing a
> > > + * system word of data).
> > > + *
> > > + * on failure, returns an error code.
>
> nit: s/on/On
> You could also skip this sentence altogether since it's kinda obvious
> but up to you.

Ack, and yeah probably best to just drop tbh :)

>
> > > + */
> >
> > I was about to say, if you already write that much documentation, why
> > not turn it into kerneldoc? :) But this function is clearly not intended
> > to have more than one caller, so ... :)
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: abstract initial stack setup to mm subsystem
  2025-04-24 21:30   ` David Hildenbrand
  2025-04-25  0:55     ` Suren Baghdasaryan
@ 2025-04-25 10:11     ` Lorenzo Stoakes
  1 sibling, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 11:30:35PM +0200, David Hildenbrand wrote:
> On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > There are peculiarities within the kernel where what is very clearly mm
> > code is performed elsewhere arbitrarily.
> >
> > This violates separation of concerns and makes it harder to refactor code
> > to make changes to how fundamental initialisation and operation of mm logic
> > is performed.
> >
> > One such case is the creation of the VMA containing the initial stack upon
> > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > in fs/exec.c.
> >
> > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > use of vma allocation and free code to fork and mm only.
> >
> > We previously did the same for the step at which we relocate the initial
> > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > establishment too.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> ...
>
> > +/*
> > + * Establish the stack VMA in an execve'd process, located temporarily at the
> > + * maximum stack address provided by the architecture.
> > + *
> > + * We later relocate this downwards in relocate_vma_down().
> > + *
> > + * This function is almost certainly NOT what you want for anything other than
> > + * early executable initialisation.
> > + *
> > + * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
> > + * maximum addressable location in the stack (that is capable of storing a
> > + * system word of data).
> > + *
> > + * on failure, returns an error code.
> > + */
>
> I was about to say, if you already write that much documentation, why not
> turn it into kerneldoc? :) But this function is clearly not intended to have
> more than one caller, so ... :)

Haha yeah, I felt for this case it's probably not necessary, bit of a blurry
line on this but as a one-off thing probably ok :P

>
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks! Sorry I forgot to say thanks also to Suren for his tag in other email,
so will say here - also thanks Suren :)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-24 21:22   ` David Hildenbrand
  2025-04-25  1:22     ` Suren Baghdasaryan
@ 2025-04-25 10:17     ` Lorenzo Stoakes
  1 sibling, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 24, 2025 at 11:22:26PM +0200, David Hildenbrand wrote:
> On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > Right now these are performed in kernel/fork.c which is odd and a violation
> > of separation of concerns, as well as preventing us from integrating this
> > and related logic into userland VMA testing going forward, and perhaps more
> > importantly - enabling us to, in a subsequent commit, make VMA
> > allocation/freeing a purely internal mm operation.
> >
> > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > of that, so we are put in the position of having to duplication some logic
> > here.
> >
> > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > great deal of mmu logic by its nature and we can eliminate code that is not
> > applicable to nommu, it seems a worthwhile trade-off.
> >
> > The intent is to move all this logic to vma.c in a subsequent commit,
> > rendering VMA allocation, freeing and duplication mm-internal-only and
> > userland testable.
>
> I'm pretty sure you tried it, but what's the big blocker to have patch #3
> first, so we can avoid the temporary move of the code to mmap.c ?

You know, I didn't :P it was late etc. ;)

But you're right, will rejig accordingly!

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: move dup_mmap() to mm
  2025-04-25  9:13   ` Pedro Falcato
@ 2025-04-25 10:18     ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:18 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Kees Cook, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, Apr 25, 2025 at 10:13:35AM +0100, Pedro Falcato wrote:
> On Thu, Apr 24, 2025 at 10:15:28PM +0100, Lorenzo Stoakes wrote:
> > This is a key step in our being able to abstract and isolate VMA allocation
> > and destruction logic.
> >
> > This function is the last one where vm_area_free() and vm_area_dup() are
> > directly referenced outside of mmap, so having this in mm allows us to
> > isolate these.
> >
> > We do the same for the nommu version which is substantially simpler.
> >
> > We place the declaration for dup_mmap() in mm/internal.h and have
> > kernel/fork.c import this in order to prevent improper use of this
> > functionality elsewhere in the kernel.
> >
> > While we're here, we remove the useless #ifdef CONFIG_MMU check around
> > mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
> > CONFIG_MMU is set.
> >
> > Suggested-by: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> Have I told you how awesome you are? Thank you so much for the series!

:) Thanks! Some much-needed positivity this week :)

>
> --
> Pedro

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:09       ` Lorenzo Stoakes
@ 2025-04-25 10:26         ` Liam R. Howlett
  2025-04-25 10:31           ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2025-04-25 10:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, David Hildenbrand, Andrew Morton,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:09]:
> On Thu, Apr 24, 2025 at 06:22:30PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > of separation of concerns, as well as preventing us from integrating this
> > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > allocation/freeing a purely internal mm operation.
> > > >
> > > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > > of that, so we are put in the position of having to duplication some logic
> >
> > s/to duplication/to duplicate
> 
> Ack will fix!
> 
> >
> > > > here.
> > > >
> > > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > > applicable to nommu, it seems a worthwhile trade-off.
> > > >
> > > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > > userland testable.
> > >
> > > I'm pretty sure you tried it, but what's the big blocker to have patch
> > > #3 first, so we can avoid the temporary move of the code to mmap.c ?
> >
> > Completely agree with David.
> 
> Ack! Yes this was a little bit of a silly one :P
> 
> > I peeked into 4/4 and it seems you want to keep vma.c completely
> > CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> > IMHO it would be much cleaner to move these functions into vma.c from
> > the beginning and have an #ifdef CONFIG_MMU there like this:
> >
> > mm/vma.c
> 
> This isn't really workable, as the _entire file_ basically contains
> CONFIG_MMU-specific stuff. so it'd be one huge #ifdef CONFIG_MMU block with
> one small #else block. It'd also be asking for bugs and issues in nommu.
> 
> I think doing it this way fits the patterns we have established for
> nommu/mmap separation, and I would say nommu is enough of a niche edge case
> for us to really not want to have to go to great lengths to find ways of
> sharing code.
> 
> I am quite concerned about us having to consider it and deal with issues
> around it so often, so want to try to avoid that as much as we can,
> ideally.

I think you're asking for more issues the way you have it now.  It could
be a very long time until someone sees that nommu isn't working,
probably an entire stable kernel cycle.  Basically the longest time it
can go before being deemed unnecessary to fix.

It could also be worse, it could end up like the arch code with bugs
over a decade old not being noticed because it was forked off into
another file.

Could we create another file for the small section of common code and
achieve your goals?

Thanks,
Liam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:26         ` Liam R. Howlett
@ 2025-04-25 10:31           ` Lorenzo Stoakes
  2025-04-25 10:45             ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:31 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, David Hildenbrand,
	Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 06:26:00AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:09]:
> > On Thu, Apr 24, 2025 at 06:22:30PM -0700, Suren Baghdasaryan wrote:
> > > On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > > of separation of concerns, as well as preventing us from integrating this
> > > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > > allocation/freeing a purely internal mm operation.
> > > > >
> > > > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > > > of that, so we are put in the position of having to duplication some logic
> > >
> > > s/to duplication/to duplicate
> >
> > Ack will fix!
> >
> > >
> > > > > here.
> > > > >
> > > > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > > > applicable to nommu, it seems a worthwhile trade-off.
> > > > >
> > > > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > > > userland testable.
> > > >
> > > > I'm pretty sure you tried it, but what's the big blocker to have patch
> > > > #3 first, so we can avoid the temporary move of the code to mmap.c ?
> > >
> > > Completely agree with David.
> >
> > Ack! Yes this was a little bit of a silly one :P
> >
> > > I peeked into 4/4 and it seems you want to keep vma.c completely
> > > CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> > > IMHO it would be much cleaner to move these functions into vma.c from
> > > the beginning and have an #ifdef CONFIG_MMU there like this:
> > >
> > > mm/vma.c
> >
> > This isn't really workable, as the _entire file_ basically contains
> > CONFIG_MMU-specific stuff. so it'd be one huge #ifdef CONFIG_MMU block with
> > one small #else block. It'd also be asking for bugs and issues in nommu.
> >
> > I think doing it this way fits the patterns we have established for
> > nommu/mmap separation, and I would say nommu is enough of a niche edge case
> > for us to really not want to have to go to great lengths to find ways of
> > sharing code.
> >
> > I am quite concerned about us having to consider it and deal with issues
> > around it so often, so want to try to avoid that as much as we can,
> > ideally.
>
> I think you're asking for more issues the way you have it now.  It could
> be a very long time until someone sees that nommu isn't working,
> probably an entire stable kernel cycle.  Basically the longest time it
> can go before being deemed unnecessary to fix.
>
> It could also be worse, it could end up like the arch code with bugs
> over a decade old not being noticed because it was forked off into
> another file.
>
> Could we create another file for the small section of common code and
> achieve your goals?

That'd completely defeat the purpose of isolating core functions to vma.c.

Again, I don't believe that bending over backwards to support this niche
use is appropriate.

And if you're making a change to vm_area_alloc(), vm_area_free(),
vm_area_init_from(), vm_area_dup() it'd seem like an oversight not to check
nommu right?

There's already a large amount of duplicated logic there specific to nommu
for which precisely the same could be said, including entirely parallel
brk(), mmap() implementations.

So this isn't a change in how we handle nommu.

>
> Thanks,
> Liam
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25  3:15   ` Kees Cook
@ 2025-04-25 10:40     ` Lorenzo Stoakes
  2025-04-25 10:53       ` Pedro Falcato
  2025-04-25 13:54       ` Liam R. Howlett
  0 siblings, 2 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
>
>
> On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >+static void vm_area_init_from(const struct vm_area_struct *src,
> >+			      struct vm_area_struct *dest)
> >+{
> >+	dest->vm_mm = src->vm_mm;
> >+	dest->vm_ops = src->vm_ops;
> >+	dest->vm_start = src->vm_start;
> >+	dest->vm_end = src->vm_end;
> >+	dest->anon_vma = src->anon_vma;
> >+	dest->vm_pgoff = src->vm_pgoff;
> >+	dest->vm_file = src->vm_file;
> >+	dest->vm_private_data = src->vm_private_data;
> >+	vm_flags_init(dest, src->vm_flags);
> >+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> >+	       sizeof(dest->vm_page_prot));
> >+	/*
> >+	 * src->shared.rb may be modified concurrently when called from
> >+	 * dup_mmap(), but the clone will reinitialize it.
> >+	 */
> >+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> >+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> >+	       sizeof(dest->vm_userfaultfd_ctx));
> >+#ifdef CONFIG_ANON_VMA_NAME
> >+	dest->anon_name = src->anon_name;
> >+#endif
> >+#ifdef CONFIG_SWAP
> >+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> >+	       sizeof(dest->swap_readahead_info));
> >+#endif
> >+#ifdef CONFIG_NUMA
> >+	dest->vm_policy = src->vm_policy;
> >+#endif
> >+}
>
> I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
>
> *dest = *src;
>
> And then do any one-off cleanups?

Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
state for some fields if we miss them here, seems unwise...

Presumably for performance?

This is, as you say, me simply propagating what exists, but I do wonder.

>
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:31           ` Lorenzo Stoakes
@ 2025-04-25 10:45             ` Lorenzo Stoakes
  2025-04-25 11:00               ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 10:45 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, David Hildenbrand,
	Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 11:31:12AM +0100, Lorenzo Stoakes wrote:
> On Fri, Apr 25, 2025 at 06:26:00AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:09]:
> > > On Thu, Apr 24, 2025 at 06:22:30PM -0700, Suren Baghdasaryan wrote:
> > > > On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > > > of separation of concerns, as well as preventing us from integrating this
> > > > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > > > allocation/freeing a purely internal mm operation.
> > > > > >
> > > > > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > > > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > > > > of that, so we are put in the position of having to duplication some logic
> > > >
> > > > s/to duplication/to duplicate
> > >
> > > Ack will fix!
> > >
> > > >
> > > > > > here.
> > > > > >
> > > > > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > > > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > > > > applicable to nommu, it seems a worthwhile trade-off.
> > > > > >
> > > > > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > > > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > > > > userland testable.
> > > > >
> > > > > I'm pretty sure you tried it, but what's the big blocker to have patch
> > > > > #3 first, so we can avoid the temporary move of the code to mmap.c ?
> > > >
> > > > Completely agree with David.
> > >
> > > Ack! Yes this was a little bit of a silly one :P
> > >
> > > > I peeked into 4/4 and it seems you want to keep vma.c completely
> > > > CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> > > > IMHO it would be much cleaner to move these functions into vma.c from
> > > > the beginning and have an #ifdef CONFIG_MMU there like this:
> > > >
> > > > mm/vma.c
> > >
> > > This isn't really workable, as the _entire file_ basically contains
> > > CONFIG_MMU-specific stuff. so it'd be one huge #ifdef CONFIG_MMU block with
> > > one small #else block. It'd also be asking for bugs and issues in nommu.
> > >
> > > I think doing it this way fits the patterns we have established for
> > > nommu/mmap separation, and I would say nommu is enough of a niche edge case
> > > for us to really not want to have to go to great lengths to find ways of
> > > sharing code.
> > >
> > > I am quite concerned about us having to consider it and deal with issues
> > > around it so often, so want to try to avoid that as much as we can,
> > > ideally.
> >
> > I think you're asking for more issues the way you have it now.  It could
> > be a very long time until someone sees that nommu isn't working,
> > probably an entire stable kernel cycle.  Basically the longest time it
> > can go before being deemed unnecessary to fix.
> >
> > It could also be worse, it could end up like the arch code with bugs
> > over a decade old not being noticed because it was forked off into
> > another file.
> >
> > Could we create another file for the small section of common code and
> > achieve your goals?
>
> That'd completely defeat the purpose of isolating core functions to vma.c.
>
> Again, I don't believe that bending over backwards to support this niche
> use is appropriate.
>
> And if you're making a change to vm_area_alloc(), vm_area_free(),
> vm_area_init_from(), vm_area_dup() it'd seem like an oversight not to check
> nommu right?
>
> There's already a large amount of duplicated logic there specific to nommu
> for which precisely the same could be said, including entirely parallel
> brk(), mmap() implementations.
>
> So this isn't a change in how we handle nommu.

I guess an alternative is to introduce a new vma_shared.c, vma_shared.h
pair of files here, that we try to allow userland isolation for so vma.c
can still use for userland testing.

This then aligns with your requirement, and keeps it vma-centric like
Suren's suggestion.

Or perhaps it could even be vma_init.c, vma_init.h? To denote that it
references the initialisation and allocation, etc. of VMAs?

Anyway we do that, we share it across all, and it solves all
problems... gives us the isolation for userland testing and also isolation
in mm, while also ensuring no code duplication with nommu.

That work?


>
> >
> > Thanks,
> > Liam
> >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:40     ` Lorenzo Stoakes
@ 2025-04-25 10:53       ` Pedro Falcato
  2025-04-25 13:54       ` Liam R. Howlett
  1 sibling, 0 replies; 31+ messages in thread
From: Pedro Falcato @ 2025-04-25 10:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kees Cook, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, David Hildenbrand, Alexander Viro, Christian Brauner,
	Jan Kara, Suren Baghdasaryan, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, Apr 25, 2025 at 11:40:00AM +0100, Lorenzo Stoakes wrote:
> On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> >
> >
> > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > >+			      struct vm_area_struct *dest)
> > >+{
> > >+	dest->vm_mm = src->vm_mm;
> > >+	dest->vm_ops = src->vm_ops;
> > >+	dest->vm_start = src->vm_start;
> > >+	dest->vm_end = src->vm_end;
> > >+	dest->anon_vma = src->anon_vma;
> > >+	dest->vm_pgoff = src->vm_pgoff;
> > >+	dest->vm_file = src->vm_file;
> > >+	dest->vm_private_data = src->vm_private_data;
> > >+	vm_flags_init(dest, src->vm_flags);
> > >+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > >+	       sizeof(dest->vm_page_prot));
> > >+	/*
> > >+	 * src->shared.rb may be modified concurrently when called from
> > >+	 * dup_mmap(), but the clone will reinitialize it.
> > >+	 */
> > >+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > >+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > >+	       sizeof(dest->vm_userfaultfd_ctx));
> > >+#ifdef CONFIG_ANON_VMA_NAME
> > >+	dest->anon_name = src->anon_name;
> > >+#endif
> > >+#ifdef CONFIG_SWAP
> > >+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > >+	       sizeof(dest->swap_readahead_info));
> > >+#endif
> > >+#ifdef CONFIG_NUMA
> > >+	dest->vm_policy = src->vm_policy;
> > >+#endif
> > >+}
> >
> > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> >
> > *dest = *src;
> >
> > And then do any one-off cleanups?
> 
> Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> state for some fields if we miss them here, seems unwise...
> 
> Presumably for performance?
> 
> This is, as you say, me simply propagating what exists, but I do wonder.

There's a particular advantage here: KMSAN will light up in all sorts of ways
if you forget to copy something explicitly, instead of silently working but also
possibly being silently broken.

Anyway, it came from here: https://lore.kernel.org/all/CAJuCfpFO3Hj+7f10e0Pnvf0U7-dHeYgvjK+4AFD8V=kmG4JA=w@mail.gmail.com/

-- 
Pedro

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:45             ` Lorenzo Stoakes
@ 2025-04-25 11:00               ` Liam R. Howlett
  2025-04-25 11:03                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2025-04-25 11:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, David Hildenbrand, Andrew Morton,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:45]:
...

> > > >
> > > > I think doing it this way fits the patterns we have established for
> > > > nommu/mmap separation, and I would say nommu is enough of a niche edge case
> > > > for us to really not want to have to go to great lengths to find ways of
> > > > sharing code.
> > > >
> > > > I am quite concerned about us having to consider it and deal with issues
> > > > around it so often, so want to try to avoid that as much as we can,
> > > > ideally.
> > >
> > > I think you're asking for more issues the way you have it now.  It could
> > > be a very long time until someone sees that nommu isn't working,
> > > probably an entire stable kernel cycle.  Basically the longest time it
> > > can go before being deemed unnecessary to fix.
> > >
> > > It could also be worse, it could end up like the arch code with bugs
> > > over a decade old not being noticed because it was forked off into
> > > another file.
> > >
> > > Could we create another file for the small section of common code and
> > > achieve your goals?
> >
> > That'd completely defeat the purpose of isolating core functions to vma.c.
> >
> > Again, I don't believe that bending over backwards to support this niche
> > use is appropriate.
> >
> > And if you're making a change to vm_area_alloc(), vm_area_free(),
> > vm_area_init_from(), vm_area_dup() it'd seem like an oversight not to check
> > nommu right?
> >
> > There's already a large amount of duplicated logic there specific to nommu
> > for which precisely the same could be said, including entirely parallel
> > brk(), mmap() implementations.
> >
> > So this isn't a change in how we handle nommu.
> 
> I guess an alternative is to introduce a new vma_shared.c, vma_shared.h
> pair of files here, that we try to allow userland isolation for so vma.c
> can still use for userland testing.
> 
> This then aligns with your requirement, and keeps it vma-centric like
> Suren's suggestion.
> 
> Or perhaps it could even be vma_init.c, vma_init.h? To denote that it
> references the initialisation and allocation, etc. of VMAs?

Sure, the name isn't as important as the concept, but I like vma_init
better than vma_shared.

> 
> Anyway we do that, we share it across all, and it solves all
> problems... gives us the isolation for userland testing and also isolation
> in mm, while also ensuring no code duplication with nommu.
> 
> That work?

Yes, this is what I was suggesting.

I really think this is the least painful way to deal with nommu.
Otherwise we will waste more time later trying to fix what was
overlooked.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 11:00               ` Liam R. Howlett
@ 2025-04-25 11:03                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 11:03 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, David Hildenbrand,
	Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 07:00:27AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:45]:
> ...
>
> > > > >
> > > > > I think doing it this way fits the patterns we have established for
> > > > > nommu/mmap separation, and I would say nommu is enough of a niche edge case
> > > > > for us to really not want to have to go to great lengths to find ways of
> > > > > sharing code.
> > > > >
> > > > > I am quite concerned about us having to consider it and deal with issues
> > > > > around it so often, so want to try to avoid that as much as we can,
> > > > > ideally.
> > > >
> > > > I think you're asking for more issues the way you have it now.  It could
> > > > be a very long time until someone sees that nommu isn't working,
> > > > probably an entire stable kernel cycle.  Basically the longest time it
> > > > can go before being deemed unnecessary to fix.
> > > >
> > > > It could also be worse, it could end up like the arch code with bugs
> > > > over a decade old not being noticed because it was forked off into
> > > > another file.
> > > >
> > > > Could we create another file for the small section of common code and
> > > > achieve your goals?
> > >
> > > That'd completely defeat the purpose of isolating core functions to vma.c.
> > >
> > > Again, I don't believe that bending over backwards to support this niche
> > > use is appropriate.
> > >
> > > And if you're making a change to vm_area_alloc(), vm_area_free(),
> > > vm_area_init_from(), vm_area_dup() it'd seem like an oversight not to check
> > > nommu right?
> > >
> > > There's already a large amount of duplicated logic there specific to nommu
> > > for which precisely the same could be said, including entirely parallel
> > > brk(), mmap() implementations.
> > >
> > > So this isn't a change in how we handle nommu.
> >
> > I guess an alternative is to introduce a new vma_shared.c, vma_shared.h
> > pair of files here, that we try to allow userland isolation for so vma.c
> > can still use for userland testing.
> >
> > This then aligns with your requirement, and keeps it vma-centric like
> > Suren's suggestion.
> >
> > Or perhaps it could even be vma_init.c, vma_init.h? To denote that it
> > references the initialisation and allocation, etc. of VMAs?
>
> Sure, the name isn't as important as the concept, but I like vma_init
> better than vma_shared.
>
> >
> > Anyway we do that, we share it across all, and it solves all
> > problems... gives us the isolation for userland testing and also isolation
> > in mm, while also ensuring no code duplication with nommu.
> >
> > That work?
>
> Yes, this is what I was suggesting.

Ack, I wondered if you meant placing it in some existing shared file such
as mm/util.c hence objecting, but this works better :)

>
> I really think this is the least painful way to deal with nommu.
> Otherwise we will waste more time later trying to fix what was
> overlooked.

Sure, I definitely understand the motivation. I resent that we have to
think about this, but we do have to live with it, so agreed this is the
best approach.

>
> Thanks,
> Liam
>

Will rework series to do this and send a v2.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:10         ` Lorenzo Stoakes
@ 2025-04-25 11:04           ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-04-25 11:04 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 11:10:00AM +0100, Lorenzo Stoakes wrote:
> On Thu, Apr 24, 2025 at 06:37:39PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Apr 24, 2025 at 6:22 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Apr 24, 2025 at 2:22 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 24.04.25 23:15, Lorenzo Stoakes wrote:
> > > > > Right now these are performed in kernel/fork.c which is odd and a violation
> > > > > of separation of concerns, as well as preventing us from integrating this
> > > > > and related logic into userland VMA testing going forward, and perhaps more
> > > > > importantly - enabling us to, in a subsequent commit, make VMA
> > > > > allocation/freeing a purely internal mm operation.
> > > > >
> > > > > There is a fly in the ointment - nommu - mmap.c is not compiled if
> > > > > CONFIG_MMU is not set, and there is no sensible place to put these outside
> > > > > of that, so we are put in the position of having to duplication some logic
> > >
> > > s/to duplication/to duplicate
> > >
> > > > > here.
> > > > >
> > > > > This isn't ideal, but since nommu is a niche use-case, already duplicates a
> > > > > great deal of mmu logic by its nature and we can eliminate code that is not
> > > > > applicable to nommu, it seems a worthwhile trade-off.
> > > > >
> > > > > The intent is to move all this logic to vma.c in a subsequent commit,
> > > > > rendering VMA allocation, freeing and duplication mm-internal-only and
> > > > > userland testable.
> > > >
> > > > I'm pretty sure you tried it, but what's the big blocker to have patch
> > > > #3 first, so we can avoid the temporary move of the code to mmap.c ?
> > >
> > > Completely agree with David.
> > > I peeked into 4/4 and it seems you want to keep vma.c completely
> > > CONFIG_MMU-centric. I know we treat NOMMU as an unwanted child but
> > > IMHO it would be much cleaner to move these functions into vma.c from
> > > the beginning and have an #ifdef CONFIG_MMU there like this:
> > >
> > > mm/vma.c
> > >
> > > /* Functions identical for MMU/NOMMU */
> > > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) {...}
> > > void __init vma_state_init(void) {...}
> > >
> > > #ifdef CONFIG_MMU
> > > static void vm_area_init_from(const struct vm_area_struct *src,
> > >                              struct vm_area_struct *dest) {...}
> > > struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> > > void vm_area_free(struct vm_area_struct *vma) {...}
> > > #else /* CONFIG_MMU */
> > > static void vm_area_init_from(const struct vm_area_struct *src,
> > >                              struct vm_area_struct *dest) {...}
> > > struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) {...}
> > > void vm_area_free(struct vm_area_struct *vma) {...}
> > > #endif /* CONFIG_MMU */
> >
> > 3/4 and 4/4 look reasonable but they can change substantially
> > depending on your answer to my suggestion above, so I'll wait for your
> > answer before moving forward.
> > Thanks for doing this!
> > Suren.
>
> You're welcome :)
>
> Well I will be fixing the issue David raised of course :) but as stated in
> previous email, I don't feel it makes sense to put nommu stuff in vma.c really.

UPDATE: As per discussions with Liam, will be going ahead with an alternative
but equivalent approach.

Thanks to both of you for your suggestions on this!

Cheers, Lorenzo

>
> >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 10:40     ` Lorenzo Stoakes
  2025-04-25 10:53       ` Pedro Falcato
@ 2025-04-25 13:54       ` Liam R. Howlett
  2025-04-25 15:32         ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2025-04-25 13:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kees Cook, Andrew Morton, Vlastimil Babka, Jann Horn,
	Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, Suren Baghdasaryan, linux-mm,
	linux-fsdevel, linux-kernel

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:40]:
> On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> >
> >
> > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > >+			      struct vm_area_struct *dest)
> > >+{
> > >+	dest->vm_mm = src->vm_mm;
> > >+	dest->vm_ops = src->vm_ops;
> > >+	dest->vm_start = src->vm_start;
> > >+	dest->vm_end = src->vm_end;
> > >+	dest->anon_vma = src->anon_vma;
> > >+	dest->vm_pgoff = src->vm_pgoff;
> > >+	dest->vm_file = src->vm_file;
> > >+	dest->vm_private_data = src->vm_private_data;
> > >+	vm_flags_init(dest, src->vm_flags);
> > >+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > >+	       sizeof(dest->vm_page_prot));
> > >+	/*
> > >+	 * src->shared.rb may be modified concurrently when called from
> > >+	 * dup_mmap(), but the clone will reinitialize it.
> > >+	 */
> > >+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > >+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > >+	       sizeof(dest->vm_userfaultfd_ctx));
> > >+#ifdef CONFIG_ANON_VMA_NAME
> > >+	dest->anon_name = src->anon_name;
> > >+#endif
> > >+#ifdef CONFIG_SWAP
> > >+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > >+	       sizeof(dest->swap_readahead_info));
> > >+#endif
> > >+#ifdef CONFIG_NUMA
> > >+	dest->vm_policy = src->vm_policy;
> > >+#endif
> > >+}
> >
> > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> >
> > *dest = *src;
> >
> > And then do any one-off cleanups?
> 
> Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> state for some fields if we miss them here, seems unwise...
> 
> Presumably for performance?
> 
> This is, as you say, me simply propagating what exists, but I do wonder.

Two things come to mind:

1. How ctors are done.  (v3 of Suren's RCU safe patch series, willy made
a comment.. I think)

2. Some race that Vlastimil came up with the copy and the RCU safeness.
IIRC it had to do with the ordering of the setting of things?

Also, looking at it again...

How is it safe to do dest->anon_name = src->anon_name?  Isn't that ref
counted?

Pretty sure it's okay, but Suren would know for sure on all of this.

Suren, maybe you could send a patch with comments on this stuff?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 13:54       ` Liam R. Howlett
@ 2025-04-25 15:32         ` Suren Baghdasaryan
  2025-04-25 15:34           ` Suren Baghdasaryan
  2025-04-25 17:12           ` Kees Cook
  0 siblings, 2 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25 15:32 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Kees Cook, Andrew Morton,
	Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand,
	Alexander Viro, Christian Brauner, Jan Kara, Suren Baghdasaryan,
	linux-mm, linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 6:55 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:40]:
> > On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> > >
> > >
> > > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > >+                        struct vm_area_struct *dest)
> > > >+{
> > > >+  dest->vm_mm = src->vm_mm;
> > > >+  dest->vm_ops = src->vm_ops;
> > > >+  dest->vm_start = src->vm_start;
> > > >+  dest->vm_end = src->vm_end;
> > > >+  dest->anon_vma = src->anon_vma;
> > > >+  dest->vm_pgoff = src->vm_pgoff;
> > > >+  dest->vm_file = src->vm_file;
> > > >+  dest->vm_private_data = src->vm_private_data;
> > > >+  vm_flags_init(dest, src->vm_flags);
> > > >+  memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > >+         sizeof(dest->vm_page_prot));
> > > >+  /*
> > > >+   * src->shared.rb may be modified concurrently when called from
> > > >+   * dup_mmap(), but the clone will reinitialize it.
> > > >+   */
> > > >+  data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > >+  memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > >+         sizeof(dest->vm_userfaultfd_ctx));
> > > >+#ifdef CONFIG_ANON_VMA_NAME
> > > >+  dest->anon_name = src->anon_name;
> > > >+#endif
> > > >+#ifdef CONFIG_SWAP
> > > >+  memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > >+         sizeof(dest->swap_readahead_info));
> > > >+#endif
> > > >+#ifdef CONFIG_NUMA
> > > >+  dest->vm_policy = src->vm_policy;
> > > >+#endif
> > > >+}
> > >
> > > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> > >
> > > *dest = *src;
> > >
> > > And then do any one-off cleanups?
> >
> > Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> > state for some fields if we miss them here, seems unwise...
> >
> > Presumably for performance?
> >
> > This is, as you say, me simply propagating what exists, but I do wonder.
>
> Two things come to mind:
>
> 1. How ctors are done.  (v3 of Suren's RCU safe patch series, willy made
> a comment.. I think)
>
> 2. Some race that Vlastimil came up with the copy and the RCU safeness.
> IIRC it had to do with the ordering of the setting of things?
>
> Also, looking at it again...
>
> How is it safe to do dest->anon_name = src->anon_name?  Isn't that ref
> counted?

dest->anon_name = src->anon_name is fine here because right after
vm_area_init_from() we call dup_anon_vma_name() which will bump up the
refcount. I don't recall why this is done this way but now looking at
it I wonder if I could call dup_anon_vma_name() directly instead of
this assignment. Might be just an overlooked legacy from the time we
memcpy'd the entire structure. I'll need to double-check.

>
> Pretty sure it's okay, but Suren would know for sure on all of this.
>
> Suren, maybe you could send a patch with comments on this stuff?

Yeah, I think I need to add some comments in this code for
clarification. We do not copy the entire vm_area_struct because we
have to preserve vma->vm_refcnt field of the dest vma. Since these
structures are allocated from a cache with SLAB_TYPESAFE_BY_RCU,
another thread might be concurrently checking the state of the dest
object by reading dest->vm_refcnt. Therefore it's important here not
to override the vm_refcnt. Changelog in
https://lore.kernel.org/all/20250213224655.1680278-18-surenb@google.com/
touches on it but a comment in the code would be indeed helpful. Will
add it but will wait for Lorenzo's refactoring to land into linux-mm
first to avoid adding merge conflicts.

>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 15:32         ` Suren Baghdasaryan
@ 2025-04-25 15:34           ` Suren Baghdasaryan
  2025-04-25 17:12           ` Kees Cook
  1 sibling, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25 15:34 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Kees Cook, Andrew Morton,
	Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand,
	Alexander Viro, Christian Brauner, Jan Kara, Suren Baghdasaryan,
	linux-mm, linux-fsdevel, linux-kernel

On Fri, Apr 25, 2025 at 8:32 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Apr 25, 2025 at 6:55 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:40]:
> > > On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> > > >
> > > >
> > > > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > > >+                        struct vm_area_struct *dest)
> > > > >+{
> > > > >+  dest->vm_mm = src->vm_mm;
> > > > >+  dest->vm_ops = src->vm_ops;
> > > > >+  dest->vm_start = src->vm_start;
> > > > >+  dest->vm_end = src->vm_end;
> > > > >+  dest->anon_vma = src->anon_vma;
> > > > >+  dest->vm_pgoff = src->vm_pgoff;
> > > > >+  dest->vm_file = src->vm_file;
> > > > >+  dest->vm_private_data = src->vm_private_data;
> > > > >+  vm_flags_init(dest, src->vm_flags);
> > > > >+  memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > > >+         sizeof(dest->vm_page_prot));
> > > > >+  /*
> > > > >+   * src->shared.rb may be modified concurrently when called from
> > > > >+   * dup_mmap(), but the clone will reinitialize it.
> > > > >+   */
> > > > >+  data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > > >+  memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > > >+         sizeof(dest->vm_userfaultfd_ctx));
> > > > >+#ifdef CONFIG_ANON_VMA_NAME
> > > > >+  dest->anon_name = src->anon_name;
> > > > >+#endif
> > > > >+#ifdef CONFIG_SWAP
> > > > >+  memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > > >+         sizeof(dest->swap_readahead_info));
> > > > >+#endif
> > > > >+#ifdef CONFIG_NUMA
> > > > >+  dest->vm_policy = src->vm_policy;
> > > > >+#endif
> > > > >+}
> > > >
> > > > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> > > >
> > > > *dest = *src;
> > > >
> > > > And then do any one-off cleanups?
> > >
> > > Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> > > state for some fields if we miss them here, seems unwise...
> > >
> > > Presumably for performance?
> > >
> > > This is, as you say, me simply propagating what exists, but I do wonder.
> >
> > Two things come to mind:
> >
> > 1. How ctors are done.  (v3 of Suren's RCU safe patch series, willy made
> > a comment.. I think)
> >
> > 2. Some race that Vlastimil came up with the copy and the RCU safeness.
> > IIRC it had to do with the ordering of the setting of things?
> >
> > Also, looking at it again...
> >
> > How is it safe to do dest->anon_name = src->anon_name?  Isn't that ref
> > counted?
>
> dest->anon_name = src->anon_name is fine here because right after
> vm_area_init_from() we call dup_anon_vma_name() which will bump up the
> refcount. I don't recall why this is done this way but now looking at
> it I wonder if I could call dup_anon_vma_name() directly instead of
> this assignment. Might be just an overlooked legacy from the time we
> memcpy'd the entire structure. I'll need to double-check.
>
> >
> > Pretty sure it's okay, but Suren would know for sure on all of this.
> >
> > Suren, maybe you could send a patch with comments on this stuff?
>
> Yeah, I think I need to add some comments in this code for
> clarification. We do not copy the entire vm_area_struct because we
> have to preserve vma->vm_refcnt field of the dest vma. Since these
> structures are allocated from a cache with SLAB_TYPESAFE_BY_RCU,
> another thread might be concurrently checking the state of the dest
> object by reading dest->vm_refcnt. Therefore it's important here not
> to override the vm_refcnt. Changelog in
> https://lore.kernel.org/all/20250213224655.1680278-18-surenb@google.com/
> touches on it but a comment in the code would be indeed helpful. Will
> add it but will wait for Lorenzo's refactoring to land into linux-mm

s/linux-mm/mm-unstable. I need my morning coffee.

> first to avoid adding merge conflicts.
>
> >
> > Thanks,
> > Liam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 15:32         ` Suren Baghdasaryan
  2025-04-25 15:34           ` Suren Baghdasaryan
@ 2025-04-25 17:12           ` Kees Cook
  2025-04-25 17:26             ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Kees Cook @ 2025-04-25 17:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton, Vlastimil Babka,
	Jann Horn, Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, Apr 25, 2025 at 08:32:48AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 25, 2025 at 6:55 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:40]:
> > > On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> > > >
> > > >
> > > > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > > >+                        struct vm_area_struct *dest)
> > > > >+{
> > > > >+  dest->vm_mm = src->vm_mm;
> > > > >+  dest->vm_ops = src->vm_ops;
> > > > >+  dest->vm_start = src->vm_start;
> > > > >+  dest->vm_end = src->vm_end;
> > > > >+  dest->anon_vma = src->anon_vma;
> > > > >+  dest->vm_pgoff = src->vm_pgoff;
> > > > >+  dest->vm_file = src->vm_file;
> > > > >+  dest->vm_private_data = src->vm_private_data;
> > > > >+  vm_flags_init(dest, src->vm_flags);
> > > > >+  memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > > >+         sizeof(dest->vm_page_prot));
> > > > >+  /*
> > > > >+   * src->shared.rb may be modified concurrently when called from
> > > > >+   * dup_mmap(), but the clone will reinitialize it.
> > > > >+   */
> > > > >+  data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > > >+  memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > > >+         sizeof(dest->vm_userfaultfd_ctx));
> > > > >+#ifdef CONFIG_ANON_VMA_NAME
> > > > >+  dest->anon_name = src->anon_name;
> > > > >+#endif
> > > > >+#ifdef CONFIG_SWAP
> > > > >+  memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > > >+         sizeof(dest->swap_readahead_info));
> > > > >+#endif
> > > > >+#ifdef CONFIG_NUMA
> > > > >+  dest->vm_policy = src->vm_policy;
> > > > >+#endif
> > > > >+}
> > > >
> > > > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> > > >
> > > > *dest = *src;
> > > >
> > > > And then do any one-off cleanups?
> > >
> > > Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> > > state for some fields if we miss them here, seems unwise...
> > >
> > > Presumably for performance?
> > >
> > > This is, as you say, me simply propagating what exists, but I do wonder.
> >
> > Two things come to mind:
> >
> > 1. How ctors are done.  (v3 of Suren's RCU safe patch series, willy made
> > a comment.. I think)
> >
> > 2. Some race that Vlastimil came up with the copy and the RCU safeness.
> > IIRC it had to do with the ordering of the setting of things?
> >
> > Also, looking at it again...
> >
> > How is it safe to do dest->anon_name = src->anon_name?  Isn't that ref
> > counted?
> 
> dest->anon_name = src->anon_name is fine here because right after
> vm_area_init_from() we call dup_anon_vma_name() which will bump up the
> refcount. I don't recall why this is done this way but now looking at
> it I wonder if I could call dup_anon_vma_name() directly instead of
> this assignment. Might be just an overlooked legacy from the time we
> memcpy'd the entire structure. I'll need to double-check.

Oh, is "dest" accessible to other CPU threads? I hadn't looked and was
assuming this was like process creation where everything gets built in
isolation and then attached to the main process tree. I was thinking
this was similar.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm
  2025-04-25 17:12           ` Kees Cook
@ 2025-04-25 17:26             ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-04-25 17:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton, Vlastimil Babka,
	Jann Horn, Pedro Falcato, David Hildenbrand, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, Apr 25, 2025 at 10:12 AM Kees Cook <kees@kernel.org> wrote:
>
> On Fri, Apr 25, 2025 at 08:32:48AM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 25, 2025 at 6:55 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250425 06:40]:
> > > > On Thu, Apr 24, 2025 at 08:15:26PM -0700, Kees Cook wrote:
> > > > >
> > > > >
> > > > > On April 24, 2025 2:15:27 PM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > > > >+                        struct vm_area_struct *dest)
> > > > > >+{
> > > > > >+  dest->vm_mm = src->vm_mm;
> > > > > >+  dest->vm_ops = src->vm_ops;
> > > > > >+  dest->vm_start = src->vm_start;
> > > > > >+  dest->vm_end = src->vm_end;
> > > > > >+  dest->anon_vma = src->anon_vma;
> > > > > >+  dest->vm_pgoff = src->vm_pgoff;
> > > > > >+  dest->vm_file = src->vm_file;
> > > > > >+  dest->vm_private_data = src->vm_private_data;
> > > > > >+  vm_flags_init(dest, src->vm_flags);
> > > > > >+  memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > > > >+         sizeof(dest->vm_page_prot));
> > > > > >+  /*
> > > > > >+   * src->shared.rb may be modified concurrently when called from
> > > > > >+   * dup_mmap(), but the clone will reinitialize it.
> > > > > >+   */
> > > > > >+  data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > > > >+  memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > > > >+         sizeof(dest->vm_userfaultfd_ctx));
> > > > > >+#ifdef CONFIG_ANON_VMA_NAME
> > > > > >+  dest->anon_name = src->anon_name;
> > > > > >+#endif
> > > > > >+#ifdef CONFIG_SWAP
> > > > > >+  memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > > > >+         sizeof(dest->swap_readahead_info));
> > > > > >+#endif
> > > > > >+#ifdef CONFIG_NUMA
> > > > > >+  dest->vm_policy = src->vm_policy;
> > > > > >+#endif
> > > > > >+}
> > > > >
> > > > > I know you're doing a big cut/paste here, but why in the world is this function written this way? Why not just:
> > > > >
> > > > > *dest = *src;
> > > > >
> > > > > And then do any one-off cleanups?
> > > >
> > > > Yup I find it odd, and error prone to be honest. We'll end up with uninitialised
> > > > state for some fields if we miss them here, seems unwise...
> > > >
> > > > Presumably for performance?
> > > >
> > > > This is, as you say, me simply propagating what exists, but I do wonder.
> > >
> > > Two things come to mind:
> > >
> > > 1. How ctors are done.  (v3 of Suren's RCU safe patch series, willy made
> > > a comment.. I think)
> > >
> > > 2. Some race that Vlastimil came up with the copy and the RCU safeness.
> > > IIRC it had to do with the ordering of the setting of things?
> > >
> > > Also, looking at it again...
> > >
> > > How is it safe to do dest->anon_name = src->anon_name?  Isn't that ref
> > > counted?
> >
> > dest->anon_name = src->anon_name is fine here because right after
> > vm_area_init_from() we call dup_anon_vma_name() which will bump up the
> > refcount. I don't recall why this is done this way but now looking at
> > it I wonder if I could call dup_anon_vma_name() directly instead of
> > this assignment. Might be just an overlooked legacy from the time we
> > memcpy'd the entire structure. I'll need to double-check.
>
> Oh, is "dest" accessible to other CPU threads? I hadn't looked and was
> assuming this was like process creation where everything gets built in
> isolation and then attached to the main process tree. I was thinking
> this was similar.

Yeah, it's process creation time but this structure is created from a
SLAB_TYPESAFE_BY_RCU cache which adds complexity. A newly allocated
object from this cache might be still accessible from another thread
holding a reference to its earlier incarnation. We need an indication
for that other thread to say "this object has been released, so the
reference you are holding is pointing to a freed or reallocated/wrong
object". vm_refcnt in this case is this indication and we are careful
not to override it even temporarily during object initialization.
Well, in truth we override it later with 0 but for the other thread
that will still mean this object is not what it wants.

I suspect you know this already but just in case
https://elixir.bootlin.com/linux/v6.14.3/source/include/linux/slab.h#L101
has more detailed explanation of SLAB_TYPESAFE_BY_RCU.

>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-25 17:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 21:15 [PATCH 0/4] move all VMA allocation, freeing and duplication logic to mm Lorenzo Stoakes
2025-04-24 21:15 ` [PATCH 1/4] mm: abstract initial stack setup to mm subsystem Lorenzo Stoakes
2025-04-24 21:30   ` David Hildenbrand
2025-04-25  0:55     ` Suren Baghdasaryan
2025-04-25 10:10       ` Lorenzo Stoakes
2025-04-25 10:11     ` Lorenzo Stoakes
2025-04-24 21:15 ` [PATCH 2/4] mm: perform VMA allocation, freeing, duplication in mm Lorenzo Stoakes
2025-04-24 21:22   ` David Hildenbrand
2025-04-25  1:22     ` Suren Baghdasaryan
2025-04-25  1:37       ` Suren Baghdasaryan
2025-04-25 10:10         ` Lorenzo Stoakes
2025-04-25 11:04           ` Lorenzo Stoakes
2025-04-25 10:09       ` Lorenzo Stoakes
2025-04-25 10:26         ` Liam R. Howlett
2025-04-25 10:31           ` Lorenzo Stoakes
2025-04-25 10:45             ` Lorenzo Stoakes
2025-04-25 11:00               ` Liam R. Howlett
2025-04-25 11:03                 ` Lorenzo Stoakes
2025-04-25 10:17     ` Lorenzo Stoakes
2025-04-25  3:15   ` Kees Cook
2025-04-25 10:40     ` Lorenzo Stoakes
2025-04-25 10:53       ` Pedro Falcato
2025-04-25 13:54       ` Liam R. Howlett
2025-04-25 15:32         ` Suren Baghdasaryan
2025-04-25 15:34           ` Suren Baghdasaryan
2025-04-25 17:12           ` Kees Cook
2025-04-25 17:26             ` Suren Baghdasaryan
2025-04-24 21:15 ` [PATCH 3/4] mm: move dup_mmap() to mm Lorenzo Stoakes
2025-04-25  9:13   ` Pedro Falcato
2025-04-25 10:18     ` Lorenzo Stoakes
2025-04-24 21:15 ` [PATCH 4/4] mm: move vm_area_alloc,dup,free() functions to vma.c Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).