[PATCH v2 0/5] fork: Page operation cleanups in the fork code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] fork: Page operation cleanups in the fork code
@ 2025-05-07 12:46 Linus Walleij
  2025-05-07 12:46 ` [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation Linus Walleij
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

This patch set consists of outtakes from a 1 year+ old
patch set from Pasha, which all stand on their own.
See:
https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/

What the code mainly does is make the fork.c file more
oriented around pages and remove reliance on THREAD_SIZE.

These are good cleanups for readability and in one case
(last patch using clear_page()) a performance improvement,
so I split these off, rebased on v6.15-rc1, addressed review
comments and send them separately.

All mentions of dynamic stack are removed from the patch
set as we have no idea whether that will go anywhere.

This is mostly MM related so when the patches are ready
I expect they would land in Andrew's patch stack.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
Changes in v2:
- Fix subject on patch 2/5
- Fix bisect problem in BUG() guard in patch 2/5
- Move back to using a local nr_pages variable in
  patch 3/5 for performance concerns.
- Use preferred patch augment format.
- Link to v1: https://lore.kernel.org/r/20250506-fork-fixes-v1-0-bd35b63f0f1b@linaro.org

---
Pasha Tatashin (5):
      fork: Clean-up ifdef logic around stack allocation
      fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code
      fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
      fork: check charging success before zeroing stack
      fork: zero vmap stack using clear_page() instead of memset()

 kernel/fork.c | 103 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 52 insertions(+), 51 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250504-fork-fixes-9378b6c57873

Best regards,
-- 
Linus Walleij <linus.walleij@linaro.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
@ 2025-05-07 12:46 ` Linus Walleij
  2025-05-13  6:29   ` Mike Rapoport
  2025-05-07 12:46 ` [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code Linus Walleij
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

From: Pasha Tatashin <pasha.tatashin@soleen.com>

There is unneeded OR in the ifdef functions that are used to allocate
and free kernel stacks based on direct map or vmap.

Therefore, clean up by Changing the order so OR is no longer needed.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com
[linus.walleij@linaro.org: Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 kernel/fork.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8e7b2b516e0bb0b1d4676ff644dc..7b9e1ad141baaeb158b1807ea9fc3ef246f5f3a7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -185,13 +185,7 @@ static inline void free_task_struct(struct task_struct *tsk)
 	kmem_cache_free(task_struct_cachep, tsk);
 }
 
-/*
- * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
- * kmemcache based allocator.
- */
-# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
-
-#  ifdef CONFIG_VMAP_STACK
+#ifdef CONFIG_VMAP_STACK
 /*
  * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
  * flush.  Try to minimize the number of calls by caching stacks.
@@ -342,7 +336,13 @@ static void free_thread_stack(struct task_struct *tsk)
 	tsk->stack_vm_area = NULL;
 }
 
-#  else /* !CONFIG_VMAP_STACK */
+#else /* !CONFIG_VMAP_STACK */
+
+/*
+ * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
+ * kmemcache based allocator.
+ */
+#if THREAD_SIZE >= PAGE_SIZE
 
 static void thread_stack_free_rcu(struct rcu_head *rh)
 {
@@ -374,8 +374,7 @@ static void free_thread_stack(struct task_struct *tsk)
 	tsk->stack = NULL;
 }
 
-#  endif /* CONFIG_VMAP_STACK */
-# else /* !(THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)) */
+#else /* !(THREAD_SIZE >= PAGE_SIZE) */
 
 static struct kmem_cache *thread_stack_cache;
 
@@ -414,7 +413,8 @@ void thread_stack_cache_init(void)
 	BUG_ON(thread_stack_cache == NULL);
 }
 
-# endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#endif /* THREAD_SIZE >= PAGE_SIZE */
+#endif /* CONFIG_VMAP_STACK */
 
 /* SLAB cache for signal_struct structures (tsk->signal) */
 static struct kmem_cache *signal_cachep;

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
  2025-05-07 12:46 ` [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation Linus Walleij
@ 2025-05-07 12:46 ` Linus Walleij
  2025-05-13  6:33   ` Mike Rapoport
  2025-05-07 12:46 ` [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Linus Walleij
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

From: Pasha Tatashin <pasha.tatashin@soleen.com>

There are two data types: "struct vm_struct" and "struct vm_stack" that
have the same local variable names: vm_stack, or vm, or s, which makes
the code confusing to read.

Change the code so the naming is consistent:

struct vm_struct is always called vm_area
struct vm_stack is always called vm_stack

One change altering vfree(vm_stack) to vfree(vm_area->addr) may look
like a semantic change but it is not: vm_area->addr points to the
vm_stack. This was done to improve readability.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com
[linus.walleij@linaro.org: Rebased and added new users of the variable names, address review comments]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 kernel/fork.c | 60 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 7b9e1ad141baaeb158b1807ea9fc3ef246f5f3a7..8b8457562740c114c640a8cc230876f6a286b246 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -198,14 +198,14 @@ struct vm_stack {
 	struct vm_struct *stack_vm_area;
 };
 
-static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
+static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 {
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
 		struct vm_struct *tmp = NULL;
 
-		if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm))
+		if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
 			return true;
 	}
 	return false;
@@ -214,11 +214,12 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
 static void thread_stack_free_rcu(struct rcu_head *rh)
 {
 	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+	struct vm_struct *vm_area = vm_stack->stack_vm_area;
 
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
 		return;
 
-	vfree(vm_stack);
+	vfree(vm_area->addr);
 }
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -231,32 +232,32 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 
 static int free_vm_stack_cache(unsigned int cpu)
 {
-	struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
+	struct vm_struct **cached_vm_stack_areas = per_cpu_ptr(cached_stacks, cpu);
 	int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		struct vm_struct *vm_stack = cached_vm_stacks[i];
+		struct vm_struct *vm_area = cached_vm_stack_areas[i];
 
-		if (!vm_stack)
+		if (!vm_area)
 			continue;
 
-		vfree(vm_stack->addr);
-		cached_vm_stacks[i] = NULL;
+		vfree(vm_area->addr);
+		cached_vm_stack_areas[i] = NULL;
 	}
 
 	return 0;
 }
 
-static int memcg_charge_kernel_stack(struct vm_struct *vm)
+static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 {
 	int i;
 	int ret;
 	int nr_charged = 0;
 
-	BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+	BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
 	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
-		ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
+		ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
 		if (ret)
 			goto err;
 		nr_charged++;
@@ -264,38 +265,35 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
 	return 0;
 err:
 	for (i = 0; i < nr_charged; i++)
-		memcg_kmem_uncharge_page(vm->pages[i], 0);
+		memcg_kmem_uncharge_page(vm_area->pages[i], 0);
 	return ret;
 }
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
-	struct vm_struct *vm;
+	struct vm_struct *vm_area;
 	void *stack;
 	int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		struct vm_struct *s;
-
-		s = this_cpu_xchg(cached_stacks[i], NULL);
-
-		if (!s)
+		vm_area = this_cpu_xchg(cached_stacks[i], NULL);
+		if (!vm_area)
 			continue;
 
 		/* Reset stack metadata. */
-		kasan_unpoison_range(s->addr, THREAD_SIZE);
+		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
 
-		stack = kasan_reset_tag(s->addr);
+		stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
 		memset(stack, 0, THREAD_SIZE);
 
-		if (memcg_charge_kernel_stack(s)) {
-			vfree(s->addr);
+		if (memcg_charge_kernel_stack(vm_area)) {
+			vfree(vm_area->addr);
 			return -ENOMEM;
 		}
 
-		tsk->stack_vm_area = s;
+		tsk->stack_vm_area = vm_area;
 		tsk->stack = stack;
 		return 0;
 	}
@@ -311,8 +309,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	if (!stack)
 		return -ENOMEM;
 
-	vm = find_vm_area(stack);
-	if (memcg_charge_kernel_stack(vm)) {
+	vm_area = find_vm_area(stack);
+	if (memcg_charge_kernel_stack(vm_area)) {
 		vfree(stack);
 		return -ENOMEM;
 	}
@@ -321,7 +319,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * free_thread_stack() can be called in interrupt context,
 	 * so cache the vm_struct.
 	 */
-	tsk->stack_vm_area = vm;
+	tsk->stack_vm_area = vm_area;
 	stack = kasan_reset_tag(stack);
 	tsk->stack = stack;
 	return 0;
@@ -517,11 +515,11 @@ void vm_area_free(struct vm_area_struct *vma)
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-		struct vm_struct *vm = task_stack_vm_area(tsk);
+		struct vm_struct *vm_area = task_stack_vm_area(tsk);
 		int i;
 
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
-			mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
+			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
 		void *stack = task_stack_page(tsk);
@@ -537,12 +535,12 @@ void exit_task_stack_account(struct task_struct *tsk)
 	account_kernel_stack(tsk, -1);
 
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-		struct vm_struct *vm;
+		struct vm_struct *vm_area;
 		int i;
 
-		vm = task_stack_vm_area(tsk);
+		vm_area = task_stack_vm_area(tsk);
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
-			memcg_kmem_uncharge_page(vm->pages[i], 0);
+			memcg_kmem_uncharge_page(vm_area->pages[i], 0);
 	}
 }
 

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
  2025-05-07 12:46 ` [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation Linus Walleij
  2025-05-07 12:46 ` [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code Linus Walleij
@ 2025-05-07 12:46 ` Linus Walleij
  2025-05-07 16:56   ` Mateusz Guzik
  2025-05-07 12:46 ` [PATCH v2 4/5] fork: check charging success before zeroing stack Linus Walleij
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

From: Pasha Tatashin <pasha.tatashin@soleen.com>

In many places number of pages in the stack is detremined via
(THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
(THREAD_SIZE / PAGE_SIZE) is indeed equal to vm_area->nr_pages.

Consistently use vm_area->nr_pages to determine the actual number
of pages allocated in the stack instead, so it is clear what is
going on here.

The assignment of a local variable nr_pages in
memcg_charge_kernel_stack() takes roughly the same amount of time
as the BUG() assertion, and the two other sites arguably should
have had a BUG() assertion as well.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-5-pasha.tatashin@soleen.com
[linus.walleij@linaro.orh: Rebased, initialized helpers variables in declaration]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 kernel/fork.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8b8457562740c114c640a8cc230876f6a286b246..c60a0ec61a421324e3733b506d99531c5965cbc6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -253,10 +253,9 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 	int i;
 	int ret;
 	int nr_charged = 0;
+	int nr_pages = vm_area->nr_pages;
 
-	BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
-	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
 		if (ret)
 			goto err;
@@ -516,9 +515,10 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
+		int nr_pages = vm_area->nr_pages;
 		int i;
 
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < nr_pages; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -535,11 +535,11 @@ void exit_task_stack_account(struct task_struct *tsk)
 	account_kernel_stack(tsk, -1);
 
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-		struct vm_struct *vm_area;
+		struct vm_struct *vm_area = task_stack_vm_area(tsk);
+		int nr_pages = vm_area->nr_pages;
 		int i;
 
-		vm_area = task_stack_vm_area(tsk);
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < nr_pages; i++)
 			memcg_kmem_uncharge_page(vm_area->pages[i], 0);
 	}
 }

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/5] fork: check charging success before zeroing stack
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
                   ` (2 preceding siblings ...)
  2025-05-07 12:46 ` [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Linus Walleij
@ 2025-05-07 12:46 ` Linus Walleij
  2025-05-13  7:38   ` Mike Rapoport
  2025-05-07 12:46 ` [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset() Linus Walleij
  2025-05-07 17:03 ` [PATCH v2 0/5] fork: Page operation cleanups in the fork code Mateusz Guzik
  5 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

From: Pasha Tatashin <pasha.tatashin@soleen.com>

No need to do zero cached stack if memcg charge fails, so move the
charging attempt before the memset operation.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-6-pasha.tatashin@soleen.com
[linus.walleij@linaro.org: Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 kernel/fork.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c60a0ec61a421324e3733b506d99531c5965cbc6..6ac4674fdf04081fbcbf1eb99167a4c990a58506 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -279,6 +279,11 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		if (!vm_area)
 			continue;
 
+		if (memcg_charge_kernel_stack(vm_area)) {
+			vfree(vm_area->addr);
+			return -ENOMEM;
+		}
+
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
 
@@ -287,11 +292,6 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		/* Clear stale pointers from reused stack. */
 		memset(stack, 0, THREAD_SIZE);
 
-		if (memcg_charge_kernel_stack(vm_area)) {
-			vfree(vm_area->addr);
-			return -ENOMEM;
-		}
-
 		tsk->stack_vm_area = vm_area;
 		tsk->stack = stack;
 		return 0;

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset()
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
                   ` (3 preceding siblings ...)
  2025-05-07 12:46 ` [PATCH v2 4/5] fork: check charging success before zeroing stack Linus Walleij
@ 2025-05-07 12:46 ` Linus Walleij
  2025-05-07 16:51   ` Mateusz Guzik
  2025-05-07 17:03 ` [PATCH v2 0/5] fork: Page operation cleanups in the fork code Mateusz Guzik
  5 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-07 12:46 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Pasha Tatashin; +Cc: Linus Walleij

From: Pasha Tatashin <pasha.tatashin@soleen.com>

Do not zero the whole span of the stack, but instead only the
pages that are part of the vm_area.

As several architectures have optimized implementations of
clear_page(), this will give the architecture a clear idea
of what is going on and will speed up the clearing operation.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-7-pasha.tatashin@soleen.com
[linus.walleij@linaro.org: Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 kernel/fork.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 6ac4674fdf04081fbcbf1eb99167a4c990a58506..d3f000b846c634221c7dafe6e64185a276c5a08b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -271,13 +271,15 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct vm_struct *vm_area;
+	int nr_pages;
 	void *stack;
-	int i;
+	int i, j;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
 		vm_area = this_cpu_xchg(cached_stacks[i], NULL);
 		if (!vm_area)
 			continue;
+		nr_pages = vm_area->nr_pages;
 
 		if (memcg_charge_kernel_stack(vm_area)) {
 			vfree(vm_area->addr);
@@ -290,7 +292,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
-		memset(stack, 0, THREAD_SIZE);
+		for (j = 0; j < nr_pages; j++)
+			clear_page(page_address(vm_area->pages[j]));
 
 		tsk->stack_vm_area = vm_area;
 		tsk->stack = stack;

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset()
  2025-05-07 12:46 ` [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset() Linus Walleij
@ 2025-05-07 16:51   ` Mateusz Guzik
  2025-05-09  5:49     ` Linus Walleij
  0 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-05-07 16:51 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:31PM +0200, Linus Walleij wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> Do not zero the whole span of the stack, but instead only the
> pages that are part of the vm_area.
> 
> As several architectures have optimized implementations of
> clear_page(), this will give the architecture a clear idea
> of what is going on and will speed up the clearing operation.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Link: https://lore.kernel.org/20240311164638.2015063-7-pasha.tatashin@soleen.com
> [linus.walleij@linaro.org: Rebased]
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
>  kernel/fork.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6ac4674fdf04081fbcbf1eb99167a4c990a58506..d3f000b846c634221c7dafe6e64185a276c5a08b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -271,13 +271,15 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
>  static int alloc_thread_stack_node(struct task_struct *tsk, int node)
>  {
>  	struct vm_struct *vm_area;
> +	int nr_pages;
>  	void *stack;
> -	int i;
> +	int i, j;
>  
>  	for (i = 0; i < NR_CACHED_STACKS; i++) {
>  		vm_area = this_cpu_xchg(cached_stacks[i], NULL);
>  		if (!vm_area)
>  			continue;
> +		nr_pages = vm_area->nr_pages;
>  
>  		if (memcg_charge_kernel_stack(vm_area)) {
>  			vfree(vm_area->addr);
> @@ -290,7 +292,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
>  		stack = kasan_reset_tag(vm_area->addr);
>  
>  		/* Clear stale pointers from reused stack. */
> -		memset(stack, 0, THREAD_SIZE);
> +		for (j = 0; j < nr_pages; j++)
> +			clear_page(page_address(vm_area->pages[j]));
>  
>  		tsk->stack_vm_area = vm_area;
>  		tsk->stack = stack;
> 
> -- 
> 2.49.0
> 
> 

I don't know if the logic works here as far as the speed up goes -- you
are in fact *taking away* information what the caller is doing.

In order to actually allow archs to optimize this you would need a
clearing func which grabs the page count.

For example, suppose 'rep stosb' is the optimal way to handle the 16K
stacks. With a clear_pages(addr, n) routine it gets issued once with the
appropriate size the CPU knows about. In code as provided here you get 4
invocations instead.

There was a patchset to support multi-page clearing, but it only covers
x86-64: https://lore.kernel.org/all/20250414034607.762653-1-ankur.a.arora@oracle.com/

Side note: funnily enough so happens memset on x86-64 is pretty bad the
moment, never using rep stos regardless of size. But that's an argument
for adding rep, after which perf will be about the same.

tl;dr I think this patch should be dropped. if multi-page clearing shows
up for more archs then this is a thing to consider.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  2025-05-07 12:46 ` [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Linus Walleij
@ 2025-05-07 16:56   ` Mateusz Guzik
  2025-05-09  5:44     ` Linus Walleij
  0 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-05-07 16:56 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:29PM +0200, Linus Walleij wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> In many places number of pages in the stack is detremined via
> (THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
> (THREAD_SIZE / PAGE_SIZE) is indeed equal to vm_area->nr_pages.
> 
> Consistently use vm_area->nr_pages to determine the actual number
> of pages allocated in the stack instead, so it is clear what is
> going on here.
> 
> The assignment of a local variable nr_pages in
> memcg_charge_kernel_stack() takes roughly the same amount of time
> as the BUG() assertion, and the two other sites arguably should
> have had a BUG() assertion as well.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Link: https://lore.kernel.org/20240311164638.2015063-5-pasha.tatashin@soleen.com
> [linus.walleij@linaro.orh: Rebased, initialized helpers variables in declaration]
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
>  kernel/fork.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8b8457562740c114c640a8cc230876f6a286b246..c60a0ec61a421324e3733b506d99531c5965cbc6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -253,10 +253,9 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
>  	int i;
>  	int ret;
>  	int nr_charged = 0;
> +	int nr_pages = vm_area->nr_pages;
>  
> -	BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
> -
> -	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
>  		if (ret)
>  			goto err;
> @@ -516,9 +515,10 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>  		struct vm_struct *vm_area = task_stack_vm_area(tsk);
> +		int nr_pages = vm_area->nr_pages;
>  		int i;
>  
> -		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> +		for (i = 0; i < nr_pages; i++)
>  			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
>  					      account * (PAGE_SIZE / 1024));
>  	} else {
> @@ -535,11 +535,11 @@ void exit_task_stack_account(struct task_struct *tsk)
>  	account_kernel_stack(tsk, -1);
>  
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> -		struct vm_struct *vm_area;
> +		struct vm_struct *vm_area = task_stack_vm_area(tsk);
> +		int nr_pages = vm_area->nr_pages;
>  		int i;
>  
> -		vm_area = task_stack_vm_area(tsk);
> -		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> +		for (i = 0; i < nr_pages; i++)
>  			memcg_kmem_uncharge_page(vm_area->pages[i], 0);
>  	}
>  }
> 
> -- 
> 2.49.0
> 
> 

I think the change makes it very much less clear what's going on here as it
suggests the page count for stacks is variable instead of fixed.

I can agree "THREAD_SIZE / PAGE_SIZE" open-coded is not the best way to
express it though.

How about THREAD_STACK_PAGE_COUNT or similar to hide the above division?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] fork: Page operation cleanups in the fork code
  2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
                   ` (4 preceding siblings ...)
  2025-05-07 12:46 ` [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset() Linus Walleij
@ 2025-05-07 17:03 ` Mateusz Guzik
  2025-05-09  6:57   ` Linus Walleij
  5 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-05-07 17:03 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:26PM +0200, Linus Walleij wrote:
> This patch set consists of outtakes from a 1 year+ old
> patch set from Pasha, which all stand on their own.
> See:
> https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
> 
> What the code mainly does is make the fork.c file more
> oriented around pages and remove reliance on THREAD_SIZE.
> 
> These are good cleanups for readability and in one case
> (last patch using clear_page()) a performance improvement,
> so I split these off, rebased on v6.15-rc1, addressed review
> comments and send them separately.
> 
> All mentions of dynamic stack are removed from the patch
> set as we have no idea whether that will go anywhere.
> 
> This is mostly MM related so when the patches are ready
> I expect they would land in Andrew's patch stack.
> 

The ifdef clean up looks fine.

If readability is concerned I think a big problem is code duplication in
alloc_thread_stack_node().

The code instead obtain the stack space and only then do the memcg
charge and tsk var popoulation work.

The real problem here is that the cache is incredibly primitive and in
fact with a performance bug (ignoring NUMA affinity -- pages get added
to the cache and removed from it without any regard for that aspect or
the node parameter!)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  2025-05-07 16:56   ` Mateusz Guzik
@ 2025-05-09  5:44     ` Linus Walleij
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Walleij @ 2025-05-09  5:44 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 7, 2025 at 6:56 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

> I think the change makes it very much less clear what's going on here as it
> suggests the page count for stacks is variable instead of fixed.

You're right, it belongs better in the original patch series where variable
page count was the intention. I'll just try to drop it for now, I think it
will be fine anyway.

Yours,
Linus Walleij


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset()
  2025-05-07 16:51   ` Mateusz Guzik
@ 2025-05-09  5:49     ` Linus Walleij
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Walleij @ 2025-05-09  5:49 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 7, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

> I don't know if the logic works here as far as the speed up goes -- you
> are in fact *taking away* information what the caller is doing.
>
> In order to actually allow archs to optimize this you would need a
> clearing func which grabs the page count.

Aha so the information of how many pages get cleared is taken away
from the call, if we assume that could be used by the arch.

> There was a patchset to support multi-page clearing, but it only covers
> x86-64: https://lore.kernel.org/all/20250414034607.762653-1-ankur.a.arora@oracle.com/
(...)
> tl;dr I think this patch should be dropped. if multi-page clearing shows
> up for more archs then this is a thing to consider.

Yeah I see the point, but I'm pretty sure I can make multi-page
clearing work on aarch64 if that patch set goes in and then
we can do something like this patch but for multipage.

I'll put it on the back burner and resend.

Thanks Mateusz,
Linus Walleij


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] fork: Page operation cleanups in the fork code
  2025-05-07 17:03 ` [PATCH v2 0/5] fork: Page operation cleanups in the fork code Mateusz Guzik
@ 2025-05-09  6:57   ` Linus Walleij
  2025-05-09 11:16     ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Walleij @ 2025-05-09  6:57 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 7, 2025 at 7:03 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

> If readability is concerned I think a big problem is code duplication in
> alloc_thread_stack_node().
>
> The code instead obtain the stack space and only then do the memcg
> charge and tsk var popoulation work.
>
> The real problem here is that the cache is incredibly primitive and in
> fact with a performance bug (ignoring NUMA affinity -- pages get added
> to the cache and removed from it without any regard for that aspect or
> the node parameter!)

Yeah you're right.

We allocate with this:
THREADINFO_GFP & ~__GFP_ACCOUNT
which is a complicated way of writing:
GFP_KERNEL | __GFP_ZERO
... i.e. not really threadinfo at all, rather a GFP_KERNEL
on masquerade.

This is a result of the define originally allowing __GFP_HIGHMEM
and that results from the independent changes:
19809c2da28aee5860ad9a2eff760730a0710df0
9b6f7e163cd0f468d1b9696b785659d3c27c8667

I'll try to think of some patch to make this look less confusing.

And when the allocated stacks are reused, as you say the NUMA node
is completely ignored.

The rationale is in commit ac496bf48d97f2503eaa353996a4dd5e4383eaf0.

I wonder what would be the best way to make this better? Improve
the cache or assume that the MM/VMM itself should be able to sort this
out in an optimized way by now and put it to the test by simply
deleting the cache and see what happens?

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] fork: Page operation cleanups in the fork code
  2025-05-09  6:57   ` Linus Walleij
@ 2025-05-09 11:16     ` Mateusz Guzik
  0 siblings, 0 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-05-09 11:16 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Fri, May 9, 2025 at 8:58 AM Linus Walleij <linus.walleij@linaro.org> wrote:
> And when the allocated stacks are reused, as you say the NUMA node
> is completely ignored.
>
> The rationale is in commit ac496bf48d97f2503eaa353996a4dd5e4383eaf0.
>
> I wonder what would be the best way to make this better? Improve
> the cache or assume that the MM/VMM itself should be able to sort this
> out in an optimized way by now and put it to the test by simply
> deleting the cache and see what happens?
>

The cache can be trivially touched up to reduce the problem:
scope_guard(preempt) around allocation and free from the cache and
check that the requested domain (including no domain) matches. There
is some already existing code to check which domains back pages on
free.

The real problem, expanding beyond just stack allocation, is that at
some point you are going to get a process initialized on one domain
but only ever running at another.

I was speculating this could be mostly fixed by asking the scheduler
in which domain does it think the process will run and then do all the
allocations with that domain as the requested target. Sorting this out
is way beyond the scope of the immediate problem though.

That aside the current caching mechanism is quite toy-ish, but I don't
have time right now to implement a RFC with something better.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation
  2025-05-07 12:46 ` [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation Linus Walleij
@ 2025-05-13  6:29   ` Mike Rapoport
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Rapoport @ 2025-05-13  6:29 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:27PM +0200, Linus Walleij wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> There is unneeded OR in the ifdef functions that are used to allocate
> and free kernel stacks based on direct map or vmap.
> 
> Therefore, clean up by Changing the order so OR is no longer needed.

nit: s/Changing/changing/

> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com
> [linus.walleij@linaro.org: Rebased]
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  kernel/fork.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code
  2025-05-07 12:46 ` [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code Linus Walleij
@ 2025-05-13  6:33   ` Mike Rapoport
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Rapoport @ 2025-05-13  6:33 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:28PM +0200, Linus Walleij wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> There are two data types: "struct vm_struct" and "struct vm_stack" that
> have the same local variable names: vm_stack, or vm, or s, which makes
> the code confusing to read.
> 
> Change the code so the naming is consistent:
> 
> struct vm_struct is always called vm_area
> struct vm_stack is always called vm_stack
> 
> One change altering vfree(vm_stack) to vfree(vm_area->addr) may look
> like a semantic change but it is not: vm_area->addr points to the
> vm_stack. This was done to improve readability.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com
> [linus.walleij@linaro.org: Rebased and added new users of the variable names, address review comments]
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  kernel/fork.c | 60 +++++++++++++++++++++++++++++------------------------------
>  1 file changed, 29 insertions(+), 31 deletions(-)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/5] fork: check charging success before zeroing stack
  2025-05-07 12:46 ` [PATCH v2 4/5] fork: check charging success before zeroing stack Linus Walleij
@ 2025-05-13  7:38   ` Mike Rapoport
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Rapoport @ 2025-05-13  7:38 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, linux-mm, Pasha Tatashin

On Wed, May 07, 2025 at 02:46:30PM +0200, Linus Walleij wrote:
> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> 
> No need to do zero cached stack if memcg charge fails, so move the
> charging attempt before the memset operation.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Link: https://lore.kernel.org/20240311164638.2015063-6-pasha.tatashin@soleen.com
> [linus.walleij@linaro.org: Rebased]
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  kernel/fork.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-13  7:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 12:46 [PATCH v2 0/5] fork: Page operation cleanups in the fork code Linus Walleij
2025-05-07 12:46 ` [PATCH v2 1/5] fork: Clean-up ifdef logic around stack allocation Linus Walleij
2025-05-13  6:29   ` Mike Rapoport
2025-05-07 12:46 ` [PATCH v2 2/5] fork: Clean-up naming of vm_stack/vm_struct variables in vmap stacks code Linus Walleij
2025-05-13  6:33   ` Mike Rapoport
2025-05-07 12:46 ` [PATCH v2 3/5] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Linus Walleij
2025-05-07 16:56   ` Mateusz Guzik
2025-05-09  5:44     ` Linus Walleij
2025-05-07 12:46 ` [PATCH v2 4/5] fork: check charging success before zeroing stack Linus Walleij
2025-05-13  7:38   ` Mike Rapoport
2025-05-07 12:46 ` [PATCH v2 5/5] fork: zero vmap stack using clear_page() instead of memset() Linus Walleij
2025-05-07 16:51   ` Mateusz Guzik
2025-05-09  5:49     ` Linus Walleij
2025-05-07 17:03 ` [PATCH v2 0/5] fork: Page operation cleanups in the fork code Mateusz Guzik
2025-05-09  6:57   ` Linus Walleij
2025-05-09 11:16     ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).