[PATCH v2 07/13] fork: Dynamic Kernel Stacks

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: David Stevens <stevensd@google.com>
To: Pasha Tatashin <pasha.tatashin@soleen.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	 Will Deacon <willdeacon@google.com>,
	Quentin Perret <qperret@google.com>,
	 Thomas Gleixner <tglx@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org,  "H. Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@kernel.org>, Xin Li <xin@zytor.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>
Cc: David Stevens <stevensd@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 07/13] fork: Dynamic Kernel Stacks
Date: Fri, 24 Apr 2026 12:14:50 -0700	[thread overview]
Message-ID: <20260424191456.2679717-8-stevensd@google.com> (raw)
In-Reply-To: <20260424191456.2679717-1-stevensd@google.com>

From: Pasha Tatashin <pasha.tatashin@soleen.com>

The core implementation of dynamic kernel stacks.

Unlike traditional kernel stacks, these stacks auto-grow as they are
used. This allows to save a significant amount of memory in the fleet
environments. Also, potentially the default size of kernel thread can be
increased in order to prevent stack overflows without compromising on
the overall memory overhead.

The dynamic kernel stacks interface provides two global functions:

1. dynamic_stack_fault().
Architectures that support dynamic kernel stacks, must call this function
in order to handle the fault in the stack.

It allocates and maps new pages into the stack. The pages are
maintained in a per-cpu data structure.

2. dynamic_stack()
Must be called as a thread leaving CPU to check if the thread has
allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set.
If this is the case, there are two things need to be performed:
  a. Charge the thread for the allocated stack pages.
  b. refill the per-cpu array so the next thread can also fault.

Dynamic kernel threads do not support "STACK_END_MAGIC", as the last
page does not have to be faulted in. However, since they are based off
vmap stacks, the guard pages always protect the dynamic kernel stacks
from overflow.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.

Therefore, the numbers should be tested for a specific workload. From
my tests I found the following values on a freshly booted idling
machines:

CPU           #Cores #Stacks  Regular(kb) Dynamic(kb)
AMD Genoa        384    5786    92576       23388
Intel Skylake    112    3182    50912       12860
AMD Rome         128    3401    54416       14784
AMD Rome         256    4908    78528       20876
Intel Haswell     72    2644    42304       10624

On all machines dynamic kernel stacks take about 25% of the original
stack memory. Only 5% of active tasks performed a stack page fault in
their life cycles.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, used vm_area->nr_pages directly in one instance]
[Depends on !PREEMPT_RT]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Fix races around accounting]
[Use GFP_ATOMIC when executing in the scheduler]
[Depend on INIT_STACK_ALL_* config]
[Fix bugs in some error paths and edge cases]
[Don't cache partially faulted stacks]
[Added out-var to tell if address is on target stack]
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/Kconfig                     |  39 ++++
 include/linux/sched.h            |  11 +-
 include/linux/sched/task_stack.h |  47 +++-
 init/init_task.c                 |   4 +
 kernel/fork.c                    | 357 +++++++++++++++++++++++++++++--
 kernel/sched/core.c              |   1 +
 6 files changed, 439 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298e..95ded79f0825 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1515,6 +1515,45 @@ config VMAP_STACK
 	  backing virtual mappings with real shadow memory, and KASAN_VMALLOC
 	  must be enabled.
 
+config HAVE_ARCH_DYNAMIC_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  that grow dynamically.
+
+	  - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
+	    stack related page faults.
+
+	  - Arch must be able to fault from interrupt context.
+
+	  - Arch must allow the kernel to handle stack faults gracefully, even
+	    during interrupt handling.
+
+	  - Exceptions such as no pages available should be handled the same
+	    in the consistent and predictable way. I.e. the exception should be
+	    handled the same as when stack overflow occurs when guard pages are
+	    touched with extra information about the allocation error.
+
+config DYNAMIC_STACK
+	default y
+	bool "Dynamically grow kernel stacks"
+	depends on THREAD_INFO_IN_TASK
+	depends on HAVE_ARCH_DYNAMIC_STACK
+	depends on VMAP_STACK
+	depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
+	depends on !KASAN
+	depends on !DEBUG_STACK_USAGE
+	depends on !STACK_GROWSUP
+	depends on !PREEMPT_RT
+	help
+	  Dynamic kernel stacks allow to save memory on machines with a lot of
+	  threads by starting with small stacks, and grow them only when needed.
+	  On workloads where most of the stack depth do not reach over one page
+	  the memory saving can be substantial. The feature requires virtually
+	  mapped kernel stacks in order to handle page faults. It requires stack
+	  initialization to preclude one thread from faulting on another thread's
+	  stack.
+
 config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	def_bool n
 	help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf..7aa06233afd5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,11 @@ struct task_struct {
 	 */
 	randomized_struct_fields_start
 
+#ifdef CONFIG_DYNAMIC_STACK
+	unsigned long			packed_stack;
+#else
 	void				*stack;
+#endif
 	refcount_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
@@ -1563,6 +1567,11 @@ struct task_struct {
 	struct timer_list		oom_reaper_timer;
 #endif
 #ifdef CONFIG_VMAP_STACK
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_stack() can be called in interrupt context,
+	 * so cache the vm_struct.
+	 */
 	struct vm_struct		*stack_vm_area;
 #endif
 #ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -1773,7 +1782,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_DYNAMIC_STACK	0x00800000	/* This thread allocated dynamic stack pages */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 1fab7e9043a3..7dcff2836d7e 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -13,6 +13,10 @@
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define DYNAMIC_STACK_MAX_ACCOUNT_MASK  ((1 << (THREAD_SIZE_ORDER + 1)) - 1)
+#endif
+
 /*
  * When accessing the stack of a non-current task that might exit, use
  * try_get_task_stack() instead.  task_stack_page will return a pointer
@@ -20,7 +24,11 @@
  */
 static __always_inline void *task_stack_page(const struct task_struct *task)
 {
+#ifdef CONFIG_DYNAMIC_STACK
+	return (void *)(task->packed_stack & ~DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+#else
 	return task->stack;
+#endif
 }
 
 #define setup_thread_stack(new,old)	do { } while(0)
@@ -30,7 +38,7 @@ static __always_inline unsigned long *end_of_stack(const struct task_struct *tas
 #ifdef CONFIG_STACK_GROWSUP
 	return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1;
 #else
-	return task->stack;
+	return task_stack_page(task);
 #endif
 }
 
@@ -83,9 +91,45 @@ static inline void put_task_stack(struct task_struct *tsk) {}
 
 void exit_task_stack_account(struct task_struct *tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+#define task_stack_end_corrupted(task)	0
+
+#ifndef THREAD_PREALLOC_PAGES
+#define THREAD_PREALLOC_PAGES		1
+#endif
+
+#define THREAD_DYNAMIC_PAGES						\
+	((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES)
+
+void dynamic_stack_refill_pages(void);
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
+bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+
+/*
+ * Refill and charge for the used pages.
+ */
+static inline void dynamic_stack(struct task_struct *tsk)
+{
+	if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) {
+		dynamic_stack_refill_pages();
+		dynamic_stack_accounting(tsk, false);
+		tsk->flags &= ~PF_DYNAMIC_STACK;
+	}
+}
+
+static inline void set_task_stack_end_magic(struct task_struct *tsk) {}
+
+#else /* !CONFIG_DYNAMIC_STACK */
+
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
 
+void set_task_stack_end_magic(struct task_struct *tsk);
+static inline void dynamic_stack(struct task_struct *tsk) {}
+
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static inline int object_is_on_stack(const void *obj)
 {
 	void *stack = task_stack_page(current);
@@ -104,7 +148,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
 	return 0;
 }
 #endif
-extern void set_task_stack_end_magic(struct task_struct *tsk);
 
 static inline int kstack_end(void *addr)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..e3645ec4ab02 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -99,7 +99,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.stack_refcount	= REFCOUNT_INIT(1),
 #endif
 	.__state	= 0,
+#ifdef CONFIG_DYNAMIC_STACK
+	.packed_stack	= (unsigned long)init_stack,
+#else
 	.stack		= init_stack,
+#endif
 	.usage		= REFCOUNT_INIT(2),
 	.flags		= PF_KTHREAD,
 	.prio		= MAX_PRIO - 20,
diff --git a/kernel/fork.c b/kernel/fork.c
index 01e0bf4f4b02..e615ef736dc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,10 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
  * accounting is performed by the code assigning/releasing stacks to tasks.
  * We need a zeroed memory without __GFP_ACCOUNT.
  */
-#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
+static gfp_t vmap_stack_gfp(bool is_atomic)
+{
+	return (is_atomic ? GFP_ATOMIC : GFP_KERNEL) | __GFP_ZERO;
+}
 
 struct vm_stack {
 	struct rcu_work work;
@@ -241,6 +244,18 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	unsigned int i;
 	int nid;
 
+#ifdef CONFIG_DYNAMIC_STACK
+	/*
+	 * Skip the cache for populated dynamic stacks to avoid punishing a
+	 * memcg with a larger charge just because it happened to pick up a
+	 * dynamic stack that's been partially faulted in. We may get a lower
+	 * number of cache hits, but stacks with dynamically faulted pages
+	 * should be fairly uncommon.
+	 */
+	if (vm_area->nr_pages != THREAD_PREALLOC_PAGES)
+		return false;
+#endif /* CONFIG_DYNAMIC_STACK */
+
 	/*
 	 * Don't cache stacks if any of the pages don't match the local domain, unless
 	 * there is no local memory to begin with.
@@ -269,11 +284,285 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	return false;
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * There is a window between when a thread refills the page pool and when it
+ * actually gets scheduled out where it can still consume pages from the pool.
+ * To guarantee the next thread has enough pages to fully populate its stack,
+ * double the size of the page pool.
+ */
+#define DYNSTK_PAGE_POOL_NR (THREAD_DYNAMIC_PAGES * 2)
+
+static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
+
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+}
+
+static void free_vmap_stack(struct vm_struct *vm_area)
+{
+	int i;
+
+	remove_vm_area(vm_area->addr);
+
+	for (i = 0; i < vm_area->nr_pages; i++)
+		__free_page(vm_area->pages[i]);
+
+	kfree(vm_area->pages);
+	kfree(vm_area);
+}
+
+static struct vm_struct *alloc_vmap_stack(int node)
+{
+	gfp_t gfp = vmap_stack_gfp(false);
+	unsigned long addr, end;
+	struct vm_struct *vm_area;
+	int err, i;
+
+	/*
+	 * Paranoid check to guarantee we never straddle a page table, so
+	 * that virt_to_kpte() is always valid in dynamic_stack_fault().
+	 */
+	BUILD_BUG_ON((PMD_SIZE % THREAD_SIZE) || (THREAD_ALIGN % THREAD_SIZE));
+
+	vm_area = get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node,
+				   gfp, __builtin_return_address(0));
+	if (!vm_area)
+		return NULL;
+
+	vm_area->pages = kmalloc_node(sizeof(void *) *
+				      (THREAD_SIZE >> PAGE_SHIFT), gfp, node);
+	if (!vm_area->pages)
+		goto cleanup_err;
+
+	for (i = 0; i < THREAD_PREALLOC_PAGES; i++) {
+		vm_area->pages[i] = alloc_pages(gfp, 0);
+		if (!vm_area->pages[i])
+			goto cleanup_err;
+		vm_area->nr_pages++;
+	}
+
+	addr = (unsigned long)vm_area->addr +
+					(THREAD_DYNAMIC_PAGES << PAGE_SHIFT);
+	end = (unsigned long)vm_area->addr + THREAD_SIZE;
+	err = vmap_pages_range(addr, end, PAGE_KERNEL, vm_area->pages, PAGE_SHIFT);
+	if (err)
+		goto cleanup_err;
+
+	return vm_area;
+cleanup_err:
+	free_vmap_stack(vm_area);
+	return NULL;
+}
+
+static struct page *noinstr dynamic_stack_get_page(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (!page)
+			continue;
+		pages[i] = NULL;
+		return page;
+	}
+
+	return NULL;
+}
+
+static int dynamic_stack_refill_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (pages[i])
+			continue;
+		pages[i] = alloc_pages(vmap_stack_gfp(false), 0);
+		if (unlikely(!pages[i])) {
+			pr_err("failed to allocate dynamic stack page for cpu[%d]\n",
+			       cpu);
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static int dynamic_stack_free_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (!pages[i])
+			continue;
+		__free_page(pages[i]);
+		pages[i] = NULL;
+	}
+
+	return 0;
+}
+
+void dynamic_stack_refill_pages(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (page)
+			continue;
+
+		/*
+		 * This is called during context switch, so we can't take any
+		 * sleeping locks. As such, we need to use GFP_ATOMIC.
+		 */
+		page = alloc_pages(vmap_stack_gfp(true), 0);
+		if (unlikely(!page))
+			pr_err_ratelimited("failed to refill per-cpu dynamic stack\n");
+		pages[i] = page;
+	}
+}
+
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
+{
+	struct vm_struct *vm_area = tsk->stack_vm_area;
+	unsigned long nr_accounted, i;
+
+	cant_sleep();
+
+	/* Verify enough low order bits in the page-aligned stack pointer. */
+	BUILD_BUG_ON(THREAD_PREALLOC_PAGES == 0 ||
+		     PAGE_SIZE - 1 <= DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+
+	nr_accounted = tsk->packed_stack & DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+
+	if (nr_accounted == DYNAMIC_STACK_MAX_ACCOUNT_MASK) {
+		WARN_ON_ONCE(finalize);
+		return 0;
+	}
+
+	for (i = THREAD_PREALLOC_PAGES + nr_accounted; i < vm_area->nr_pages; i++) {
+		struct page *page = vm_area->pages[i];
+
+		int ret = memcg_kmem_charge_page(page, GFP_ATOMIC, 0);
+		/*
+		 * XXX Since stack pages were already allocated, we should never
+		 * fail charging. Therefore, we should probably induce force
+		 * charge and oom killing if charge fails.
+		 */
+		if (unlikely(ret))
+			pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n");
+
+		mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
+				      PAGE_SIZE / 1024);
+	}
+
+	if (finalize) {
+		tsk->packed_stack |= DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+	} else {
+		tsk->packed_stack &= ~DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+		tsk->packed_stack |= (i - THREAD_PREALLOC_PAGES);
+	}
+
+	return i;
+}
+
+bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
+{
+	unsigned long stack, hole_end, addr;
+	struct vm_struct *vm_area;
+	struct page *page;
+	int nr_pages;
+	pte_t *pte;
+
+	cant_sleep();
+
+	if (WARN_ON(in_nmi())) {
+		*on_stack = false;
+		return false;
+	}
+
+	/* check if address is inside the kernel stack area */
+	stack = (unsigned long)task_stack_page(tsk);
+	if (address < stack || address >= stack + THREAD_SIZE) {
+		*on_stack = false;
+		return false;
+	}
+	*on_stack = true;
+
+	vm_area = tsk->stack_vm_area;
+	if (WARN_ON_ONCE(!vm_area))
+		return false;
+
+	nr_pages = vm_area->nr_pages;
+
+	/* Check if fault address is within the stack hole */
+	hole_end = stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT);
+	if (address >= hole_end)
+		return false;
+
+	/*
+	 * Most likely we faulted in the page right next to the last mapped
+	 * page in the stack, however, it is possible (but very unlikely) that
+	 * the faulted page is actually skips some pages in the stack. Make sure
+	 * we do not create  more than one holes in the stack, and map every
+	 * page between the current fault  address and the last page that is
+	 * mapped in the stack.
+	 */
+	address = PAGE_ALIGN_DOWN(address);
+	for (addr = hole_end - PAGE_SIZE; addr >= address; addr -= PAGE_SIZE) {
+		/* Take the next page from the per-cpu list */
+		page = dynamic_stack_get_page();
+		if (!page) {
+			instrumentation_begin();
+			pr_emerg("Failed to allocate a page during kernel_stack_fault\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Add the new page entry to the page table */
+		pte = virt_to_kpte(addr);
+		if (!pte) {
+			instrumentation_begin();
+			pr_emerg("The PTE page table for a kernel stack is not found\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Make sure there are no existing mappings at this address */
+		if (pte_present(*pte)) {
+			instrumentation_begin();
+			pr_emerg("The PTE contains a mapping\n");
+			instrumentation_end();
+			return false;
+		}
+		set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+
+		/* Store the new page in the stack's vm_area */
+		vm_area->pages[nr_pages] = page;
+		vm_area->nr_pages = ++nr_pages;
+	}
+
+	/* Refill the pcp stack pages during context switch */
+	tsk->flags |= PF_DYNAMIC_STACK;
+
+	return true;
+}
+
+#else /* !CONFIG_DYNAMIC_STACK */
 static inline struct vm_struct *alloc_vmap_stack(int node)
 {
 	void *stack;
 
-	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, vmap_stack_gfp(false),
 			       node, __builtin_return_address(0));
 
 	return stack ? find_vm_area(stack) : NULL;
@@ -284,6 +573,13 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
 	vfree(vm_area->addr);
 }
 
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->stack = kasan_reset_tag(vm_area->addr);
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static void thread_stack_free_work(struct work_struct *work)
 {
 	struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
@@ -300,9 +596,9 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 	struct vm_stack *vm_stack;
 
 	if (IS_ENABLED(CONFIG_STACK_GROWSUP))
-		vm_stack = tsk->stack;
+		vm_stack = task_stack_page(tsk);
 	else
-		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
+		vm_stack = task_stack_page(tsk) + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
 	INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
@@ -361,14 +657,13 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-		tsk->stack = kasan_reset_tag(vm_area->addr);
+		link_vmap_stack_to_task(tsk, vm_area);
 
 		/* Clear stale pointers from reused stack. */
 		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
 			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
-		memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+		memset(task_stack_page(tsk) + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
-		tsk->stack_vm_area = vm_area;
 		return 0;
 	}
 
@@ -380,22 +675,20 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		free_vmap_stack(vm_area);
 		return -ENOMEM;
 	}
-	/*
-	 * We can't call find_vm_area() in interrupt context, and
-	 * free_thread_stack() can be called in interrupt context,
-	 * so cache the vm_struct.
-	 */
-	tsk->stack_vm_area = vm_area;
-	tsk->stack = kasan_reset_tag(vm_area->addr);
+	link_vmap_stack_to_task(tsk, vm_area);
 	return 0;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	if (!try_release_thread_stack_to_cache(tsk->stack_vm_area))
+	if (!try_release_thread_stack_to_cache(task_stack_vm_area(tsk)))
 		thread_stack_delayed_free(tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+	tsk->packed_stack = 0;
+#else
 	tsk->stack = NULL;
+#endif
 	tsk->stack_vm_area = NULL;
 }
 
@@ -498,9 +791,27 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
-		int i;
+		int i, nr_accounted;
 
-		for (i = 0; i < vm_area->nr_pages; i++)
+#ifdef CONFIG_DYNAMIC_STACK
+		/*
+		 * For the exit path, resolve any pending accounting to avoid
+		 * underflow. Finalize to skip accounting for any faults that
+		 * happen between here and this thread's final __schedule()
+		 * call in do_task_dead().
+		 */
+		if (account < 0) {
+			preempt_disable();
+			nr_accounted = dynamic_stack_accounting(tsk, true);
+			preempt_enable();
+		} else {
+			nr_accounted = THREAD_PREALLOC_PAGES;
+		}
+#else
+		nr_accounted = vm_area->nr_pages;
+#endif
+
+		for (i = 0; i < nr_accounted; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -901,6 +1212,16 @@ void __init fork_init(void)
 			  NULL, free_vm_stack_cache);
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack",
+			  dynamic_stack_refill_pages_cpu,
+			  dynamic_stack_free_pages_cpu);
+	/*
+	 * Fill the dynamic stack pages for the boot CPU, others will be filled
+	 * as CPUs are onlined.
+	 */
+	dynamic_stack_refill_pages_cpu(smp_processor_id());
+#endif
 	scs_init();
 
 	lockdep_init_task(&init_task);
@@ -914,6 +1235,7 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
 	return 0;
 }
 
+#ifndef CONFIG_DYNAMIC_STACK
 void set_task_stack_end_magic(struct task_struct *tsk)
 {
 	unsigned long *stackend;
@@ -921,6 +1243,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 }
+#endif
 
 static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..417269a86973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6783,6 +6783,7 @@ static void __sched notrace __schedule(int sched_mode)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
+	dynamic_stack(prev);
 	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog

next prev parent reply	other threads:[~2026-04-24 19:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` David Stevens [this message]
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-04-25  9:19   ` H. Peter Anvin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:102ddbd4298 dfblob:95ded79f082 dfblob:5a5d3dbc9cd
dfblob:7aa06233afd dfblob:1fab7e9043a dfblob:7dcff2836d7
dfblob:5c838757fc1 dfblob:e3645ec4ab0 dfblob:01e0bf4f4b0
dfblob:e615ef736dc dfblob:496dff740dc dfblob:417269a8697 )
 OR (
bs:"[PATCH v2 07/13] fork: Dynamic Kernel Stacks" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260424191456.2679717-8-stevensd@google.com \
    --to=stevensd@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=willdeacon@google.com \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox