[PATCH v2 00/13] Dynamic Kernel Stacks

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 00/13] Dynamic Kernel Stacks
@ 2026-04-24 19:14 David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
                   ` (13 more replies)
  0 siblings, 14 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.

The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.

The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.

The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.

The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).

This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.

[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1

David Stevens (7):
  fork: Don't assume fully populated stack during reuse
  fork: Move vm_stack to the beginning of the stack
  fork: Move vmap stack freeing to work queue
  fork: Store task pointer in unpopulated stack ptes
  x86/entry/fred: encode frame pointer on entry
  x86: Add support for dynamic kernel stacks via FRED
  x86: Add support for dynamic kernel stacks via IST

Pasha Tatashin (6):
  fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  fork: separate vmap stack allocation and free calls
  mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
    functions
  fork: Dynamic Kernel Stacks
  task_stack.h: Add stack_not_used() support for dynamic stack
  fork: Dynamic Kernel Stack accounting

 arch/Kconfig                          |  38 ++
 arch/x86/Kconfig                      |   1 +
 arch/x86/entry/entry_64.S             |  49 ++-
 arch/x86/entry/entry_64_fred.S        |  57 +++
 arch/x86/include/asm/cpu_entry_area.h |  18 +
 arch/x86/include/asm/idtentry.h       |  38 +-
 arch/x86/include/asm/page_64_types.h  |  10 +-
 arch/x86/include/asm/pgtable_64.h     |  36 ++
 arch/x86/include/asm/processor.h      |   6 +
 arch/x86/include/asm/traps.h          |   5 +
 arch/x86/kernel/cpu/common.c          |  11 +
 arch/x86/kernel/dumpstack_64.c        |  10 +-
 arch/x86/kernel/fred.c                |  20 +-
 arch/x86/kernel/idt.c                 |  57 +--
 arch/x86/kernel/nmi.c                 |   9 +
 arch/x86/lib/usercopy.c               |   9 +
 arch/x86/mm/cpu_entry_area.c          |  17 +
 arch/x86/mm/dump_pagetables.c         |  14 +-
 arch/x86/mm/fault.c                   | 101 +++++-
 include/linux/mmzone.h                |   3 +
 include/linux/sched.h                 |  11 +-
 include/linux/sched/task_stack.h      |  48 ++-
 include/linux/vmalloc.h               |  14 +
 init/init_task.c                      |   4 +
 kernel/exit.c                         |  22 ++
 kernel/fork.c                         | 481 ++++++++++++++++++++++++--
 kernel/sched/core.c                   |   1 +
 mm/memcontrol.c                       |  10 +
 mm/vmalloc.c                          |  27 +-
 mm/vmstat.c                           |   3 +
 30 files changed, 1049 insertions(+), 81 deletions(-)

base-commit: 028ef9c96e96197026887c0f092424679298aae8
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

In many places number of pages in the stack is detremined via
(THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
(THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.

However, with dynamic stacks, the number of pages in vm_area will grow
with stack, therefore, use vm_area->nr_pages to determine the actual
number of pages allocated in stack.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, also skipped intermediary helper variable nr_pages]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index bc2bf58b93b6..8961b895bf05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -312,9 +312,7 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 	int ret;
 	int nr_charged = 0;
 
-	BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
-	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+	for (i = 0; i < vm_area->nr_pages; i++) {
 		ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
 		if (ret)
 			goto err;
@@ -484,7 +482,7 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
 		int i;
 
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < vm_area->nr_pages; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -505,7 +503,7 @@ void exit_task_stack_account(struct task_struct *tsk)
 		int i;
 
 		vm_area = task_stack_vm_area(tsk);
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < vm_area->nr_pages; i++)
 			memcg_kmem_uncharge_page(vm_area->pages[i], 0);
 	}
 }
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

In preparation for dynamic kernel stacks, don't assume that
vm_area->nr_pages matches THREAD_SIZE when clearing a stack for reuse.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8961b895bf05..50772c0cc5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -332,6 +332,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 	vm_area = alloc_thread_stack_node_from_cache(tsk, node);
 	if (vm_area) {
+		unsigned long memset_offset = 0;
+
 		if (memcg_charge_kernel_stack(vm_area)) {
 			vfree(vm_area->addr);
 			return -ENOMEM;
@@ -343,7 +345,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
-		memset(stack, 0, THREAD_SIZE);
+		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
+			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
+		memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
 		tsk->stack_vm_area = vm_area;
 		tsk->stack = stack;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
  2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

The vm_stack struct used to free stacks via an RCU callback is stored
directly in the stack being freed. Make sure it's stored at the
beginning of the stack regardless of stack growth direction, to avoid
faults on partially allocated dynamic stacks.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 50772c0cc5da..72c081db492c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -282,7 +282,12 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
 {
-	struct vm_stack *vm_stack = tsk->stack;
+	struct vm_stack *vm_stack;
+
+	if (IS_ENABLED(CONFIG_STACK_GROWSUP))
+		vm_stack = tsk->stack;
+	else
+		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
 	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 04/13] fork: separate vmap stack allocation and free calls
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (2 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

In preparation for the dynamic stacks, separate out the
__vmalloc_node_range and vfree calls from the vmap based stack
allocations. The dynamic stacks will use their own variants of these
functions.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Fix a bug in original patch: free_vmap_stack(vm_area->addr)]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Add missing free_vmap_stack conversion, fix typos, rebase]
Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 72c081db492c..8bf32815f422 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -269,6 +269,21 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	return false;
 }
 
+static inline struct vm_struct *alloc_vmap_stack(int node)
+{
+	void *stack;
+
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+			       node, __builtin_return_address(0));
+
+	return stack ? find_vm_area(stack) : NULL;
+}
+
+static inline void free_vmap_stack(struct vm_struct *vm_area)
+{
+	vfree(vm_area->addr);
+}
+
 static void thread_stack_free_rcu(struct rcu_head *rh)
 {
 	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
@@ -277,7 +292,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
 		return;
 
-	vfree(vm_area->addr);
+	free_vmap_stack(vm_area);
 }
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -304,7 +319,7 @@ static int free_vm_stack_cache(unsigned int cpu)
 		if (!vm_area)
 			continue;
 
-		vfree(vm_area->addr);
+		free_vmap_stack(vm_area);
 		cached_vm_stack_areas[i] = NULL;
 	}
 
@@ -333,41 +348,35 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct vm_struct *vm_area;
-	void *stack;
 
 	vm_area = alloc_thread_stack_node_from_cache(tsk, node);
 	if (vm_area) {
 		unsigned long memset_offset = 0;
 
 		if (memcg_charge_kernel_stack(vm_area)) {
-			vfree(vm_area->addr);
+			free_vmap_stack(vm_area);
 			return -ENOMEM;
 		}
 
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-
-		stack = kasan_reset_tag(vm_area->addr);
+		tsk->stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
 		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
 			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
-		memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+		memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
 		tsk->stack_vm_area = vm_area;
-		tsk->stack = stack;
 		return 0;
 	}
 
-	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
-				     GFP_VMAP_STACK,
-				     node, __builtin_return_address(0));
-	if (!stack)
+	vm_area = alloc_vmap_stack(node);
+	if (!vm_area)
 		return -ENOMEM;
 
-	vm_area = find_vm_area(stack);
 	if (memcg_charge_kernel_stack(vm_area)) {
-		vfree(stack);
+		free_vmap_stack(vm_area);
 		return -ENOMEM;
 	}
 	/*
@@ -376,8 +385,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * so cache the vm_struct.
 	 */
 	tsk->stack_vm_area = vm_area;
-	stack = kasan_reset_tag(stack);
-	tsk->stack = stack;
+	tsk->stack = kasan_reset_tag(vm_area->addr);
 	return 0;
 }
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (3 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

get_vm_area_node()
Unlike the other public get_vm_area_* variants, this one accepts node
from which to allocate data structure, and also the align, which allows
to create vm area with a specific alignment.

This call is going to be used by dynamic stacks in order to ensure that
the stack VM area of a specific alignment, and that even if there is
only one page mapped, no page table allocations are going to be needed
to map the other stack pages.

vmap_pages_range()
We will need it from kernel/fork.c in order to map the initial stack
pages, so export the function and add a forward declaration of this
function to the linux/vmalloc.h header.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Switched to vmap_pages_range instead of noflush variant, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/vmalloc.h | 14 ++++++++++++++
 mm/vmalloc.c            | 25 +++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e8e94f90d686..7b56a0b998ab 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -250,6 +250,9 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 					unsigned long flags,
 					unsigned long start, unsigned long end,
 					const void *caller);
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+				   unsigned long flags, int node, gfp_t gfp,
+				   const void *caller);
 void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
@@ -301,11 +304,22 @@ static inline void set_vm_flush_reset_perms(void *addr)
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+				  pgprot_t prot, struct page **pages, unsigned int page_shift);
+
 #else  /* !CONFIG_MMU */
 #define VMALLOC_TOTAL 0UL
 
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 static inline void set_vm_flush_reset_perms(void *addr) {}
+static inline
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+				  pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_MMU */
 
 #if defined(CONFIG_MMU) && defined(CONFIG_SMP)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a4402..39b7e118cbce 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -722,6 +722,7 @@ int vmap_pages_range(unsigned long addr, unsigned long end,
 {
 	return __vmap_pages_range(addr, end, prot, pages, page_shift, GFP_KERNEL);
 }
+EXPORT_SYMBOL_GPL(vmap_pages_range);
 
 static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
 				unsigned long end)
@@ -3285,6 +3286,30 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
 				  NUMA_NO_NODE, GFP_KERNEL, caller);
 }
 
+/**
+ * get_vm_area_node - reserve a contiguous and aligned kernel virtual area
+ * @size:	 size of the area
+ * @align:	 alignment of the start address of the area
+ * @flags:	 %VM_IOREMAP for I/O mappings
+ * @node:	 NUMA node from which to allocate the area data structure
+ * @gfp:	 Flags to pass to the allocator
+ * @caller:	 Caller to be stored in the vm area data structure
+ *
+ * Search for an area of @size/align in the kernel virtual mapping area and
+ * reserve it for our purposes. Returns the area descriptor on success or %NULL
+ * on failure.
+ *
+ * Return: the area descriptor on success or %NULL on failure.
+ */
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+				   unsigned long flags, int node, gfp_t gfp,
+				   const void *caller)
+{
+	return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+				  VMALLOC_START, VMALLOC_END,
+				  node, gfp, caller);
+}
+
 /**
  * find_vm_area - find a continuous kernel virtual area
  * @addr:	  base address
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 06/13] fork: Move vmap stack freeing to work queue
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (4 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

For vmap stacks not immediately released into the stack cache, free them
in a workqueue instead of via call_rcu(). In an RCU context, vfree
already schedules the actual freeing on the per-cpu system workqueue, so
this change only affects when exactly the second attempt to put the
stack into the stack cache occurs.

Moving freeing to a workqueue will allow for freeing dynamic stacks in a
sleepable context (for remove_vm_area), rather than relying on vfree
dispatching to a workqueue via vfree_atomic.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8bf32815f422..01e0bf4f4b02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -205,7 +205,7 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 #define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
 
 struct vm_stack {
-	struct rcu_head rcu;
+	struct rcu_work work;
 	struct vm_struct *stack_vm_area;
 };
 
@@ -284,9 +284,9 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
 	vfree(vm_area->addr);
 }
 
-static void thread_stack_free_rcu(struct rcu_head *rh)
+static void thread_stack_free_work(struct work_struct *work)
 {
-	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+	struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
 	struct vm_struct *vm_area = vm_stack->stack_vm_area;
 
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
@@ -305,7 +305,8 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
-	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+	INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
+	queue_rcu_work(system_wq, &vm_stack->work);
 }
 
 static int free_vm_stack_cache(unsigned int cpu)
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 07/13] fork: Dynamic Kernel Stacks
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (5 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

The core implementation of dynamic kernel stacks.

Unlike traditional kernel stacks, these stacks auto-grow as they are
used. This allows to save a significant amount of memory in the fleet
environments. Also, potentially the default size of kernel thread can be
increased in order to prevent stack overflows without compromising on
the overall memory overhead.

The dynamic kernel stacks interface provides two global functions:

1. dynamic_stack_fault().
Architectures that support dynamic kernel stacks, must call this function
in order to handle the fault in the stack.

It allocates and maps new pages into the stack. The pages are
maintained in a per-cpu data structure.

2. dynamic_stack()
Must be called as a thread leaving CPU to check if the thread has
allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set.
If this is the case, there are two things need to be performed:
  a. Charge the thread for the allocated stack pages.
  b. refill the per-cpu array so the next thread can also fault.

Dynamic kernel threads do not support "STACK_END_MAGIC", as the last
page does not have to be faulted in. However, since they are based off
vmap stacks, the guard pages always protect the dynamic kernel stacks
from overflow.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.

Therefore, the numbers should be tested for a specific workload. From
my tests I found the following values on a freshly booted idling
machines:

CPU           #Cores #Stacks  Regular(kb) Dynamic(kb)
AMD Genoa        384    5786    92576       23388
Intel Skylake    112    3182    50912       12860
AMD Rome         128    3401    54416       14784
AMD Rome         256    4908    78528       20876
Intel Haswell     72    2644    42304       10624

On all machines dynamic kernel stacks take about 25% of the original
stack memory. Only 5% of active tasks performed a stack page fault in
their life cycles.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, used vm_area->nr_pages directly in one instance]
[Depends on !PREEMPT_RT]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Fix races around accounting]
[Use GFP_ATOMIC when executing in the scheduler]
[Depend on INIT_STACK_ALL_* config]
[Fix bugs in some error paths and edge cases]
[Don't cache partially faulted stacks]
[Added out-var to tell if address is on target stack]
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/Kconfig                     |  39 ++++
 include/linux/sched.h            |  11 +-
 include/linux/sched/task_stack.h |  47 +++-
 init/init_task.c                 |   4 +
 kernel/fork.c                    | 357 +++++++++++++++++++++++++++++--
 kernel/sched/core.c              |   1 +
 6 files changed, 439 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298e..95ded79f0825 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1515,6 +1515,45 @@ config VMAP_STACK
 	  backing virtual mappings with real shadow memory, and KASAN_VMALLOC
 	  must be enabled.
 
+config HAVE_ARCH_DYNAMIC_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  that grow dynamically.
+
+	  - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
+	    stack related page faults.
+
+	  - Arch must be able to fault from interrupt context.
+
+	  - Arch must allow the kernel to handle stack faults gracefully, even
+	    during interrupt handling.
+
+	  - Exceptions such as no pages available should be handled the same
+	    in the consistent and predictable way. I.e. the exception should be
+	    handled the same as when stack overflow occurs when guard pages are
+	    touched with extra information about the allocation error.
+
+config DYNAMIC_STACK
+	default y
+	bool "Dynamically grow kernel stacks"
+	depends on THREAD_INFO_IN_TASK
+	depends on HAVE_ARCH_DYNAMIC_STACK
+	depends on VMAP_STACK
+	depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
+	depends on !KASAN
+	depends on !DEBUG_STACK_USAGE
+	depends on !STACK_GROWSUP
+	depends on !PREEMPT_RT
+	help
+	  Dynamic kernel stacks allow to save memory on machines with a lot of
+	  threads by starting with small stacks, and grow them only when needed.
+	  On workloads where most of the stack depth do not reach over one page
+	  the memory saving can be substantial. The feature requires virtually
+	  mapped kernel stacks in order to handle page faults. It requires stack
+	  initialization to preclude one thread from faulting on another thread's
+	  stack.
+
 config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	def_bool n
 	help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf..7aa06233afd5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,11 @@ struct task_struct {
 	 */
 	randomized_struct_fields_start
 
+#ifdef CONFIG_DYNAMIC_STACK
+	unsigned long			packed_stack;
+#else
 	void				*stack;
+#endif
 	refcount_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
@@ -1563,6 +1567,11 @@ struct task_struct {
 	struct timer_list		oom_reaper_timer;
 #endif
 #ifdef CONFIG_VMAP_STACK
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_stack() can be called in interrupt context,
+	 * so cache the vm_struct.
+	 */
 	struct vm_struct		*stack_vm_area;
 #endif
 #ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -1773,7 +1782,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_DYNAMIC_STACK	0x00800000	/* This thread allocated dynamic stack pages */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 1fab7e9043a3..7dcff2836d7e 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -13,6 +13,10 @@
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define DYNAMIC_STACK_MAX_ACCOUNT_MASK  ((1 << (THREAD_SIZE_ORDER + 1)) - 1)
+#endif
+
 /*
  * When accessing the stack of a non-current task that might exit, use
  * try_get_task_stack() instead.  task_stack_page will return a pointer
@@ -20,7 +24,11 @@
  */
 static __always_inline void *task_stack_page(const struct task_struct *task)
 {
+#ifdef CONFIG_DYNAMIC_STACK
+	return (void *)(task->packed_stack & ~DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+#else
 	return task->stack;
+#endif
 }
 
 #define setup_thread_stack(new,old)	do { } while(0)
@@ -30,7 +38,7 @@ static __always_inline unsigned long *end_of_stack(const struct task_struct *tas
 #ifdef CONFIG_STACK_GROWSUP
 	return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1;
 #else
-	return task->stack;
+	return task_stack_page(task);
 #endif
 }
 
@@ -83,9 +91,45 @@ static inline void put_task_stack(struct task_struct *tsk) {}
 
 void exit_task_stack_account(struct task_struct *tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+#define task_stack_end_corrupted(task)	0
+
+#ifndef THREAD_PREALLOC_PAGES
+#define THREAD_PREALLOC_PAGES		1
+#endif
+
+#define THREAD_DYNAMIC_PAGES						\
+	((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES)
+
+void dynamic_stack_refill_pages(void);
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
+bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+
+/*
+ * Refill and charge for the used pages.
+ */
+static inline void dynamic_stack(struct task_struct *tsk)
+{
+	if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) {
+		dynamic_stack_refill_pages();
+		dynamic_stack_accounting(tsk, false);
+		tsk->flags &= ~PF_DYNAMIC_STACK;
+	}
+}
+
+static inline void set_task_stack_end_magic(struct task_struct *tsk) {}
+
+#else /* !CONFIG_DYNAMIC_STACK */
+
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
 
+void set_task_stack_end_magic(struct task_struct *tsk);
+static inline void dynamic_stack(struct task_struct *tsk) {}
+
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static inline int object_is_on_stack(const void *obj)
 {
 	void *stack = task_stack_page(current);
@@ -104,7 +148,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
 	return 0;
 }
 #endif
-extern void set_task_stack_end_magic(struct task_struct *tsk);
 
 static inline int kstack_end(void *addr)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..e3645ec4ab02 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -99,7 +99,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.stack_refcount	= REFCOUNT_INIT(1),
 #endif
 	.__state	= 0,
+#ifdef CONFIG_DYNAMIC_STACK
+	.packed_stack	= (unsigned long)init_stack,
+#else
 	.stack		= init_stack,
+#endif
 	.usage		= REFCOUNT_INIT(2),
 	.flags		= PF_KTHREAD,
 	.prio		= MAX_PRIO - 20,
diff --git a/kernel/fork.c b/kernel/fork.c
index 01e0bf4f4b02..e615ef736dc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,10 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
  * accounting is performed by the code assigning/releasing stacks to tasks.
  * We need a zeroed memory without __GFP_ACCOUNT.
  */
-#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
+static gfp_t vmap_stack_gfp(bool is_atomic)
+{
+	return (is_atomic ? GFP_ATOMIC : GFP_KERNEL) | __GFP_ZERO;
+}
 
 struct vm_stack {
 	struct rcu_work work;
@@ -241,6 +244,18 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	unsigned int i;
 	int nid;
 
+#ifdef CONFIG_DYNAMIC_STACK
+	/*
+	 * Skip the cache for populated dynamic stacks to avoid punishing a
+	 * memcg with a larger charge just because it happened to pick up a
+	 * dynamic stack that's been partially faulted in. We may get a lower
+	 * number of cache hits, but stacks with dynamically faulted pages
+	 * should be fairly uncommon.
+	 */
+	if (vm_area->nr_pages != THREAD_PREALLOC_PAGES)
+		return false;
+#endif /* CONFIG_DYNAMIC_STACK */
+
 	/*
 	 * Don't cache stacks if any of the pages don't match the local domain, unless
 	 * there is no local memory to begin with.
@@ -269,11 +284,285 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	return false;
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * There is a window between when a thread refills the page pool and when it
+ * actually gets scheduled out where it can still consume pages from the pool.
+ * To guarantee the next thread has enough pages to fully populate its stack,
+ * double the size of the page pool.
+ */
+#define DYNSTK_PAGE_POOL_NR (THREAD_DYNAMIC_PAGES * 2)
+
+static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
+
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+}
+
+static void free_vmap_stack(struct vm_struct *vm_area)
+{
+	int i;
+
+	remove_vm_area(vm_area->addr);
+
+	for (i = 0; i < vm_area->nr_pages; i++)
+		__free_page(vm_area->pages[i]);
+
+	kfree(vm_area->pages);
+	kfree(vm_area);
+}
+
+static struct vm_struct *alloc_vmap_stack(int node)
+{
+	gfp_t gfp = vmap_stack_gfp(false);
+	unsigned long addr, end;
+	struct vm_struct *vm_area;
+	int err, i;
+
+	/*
+	 * Paranoid check to guarantee we never straddle a page table, so
+	 * that virt_to_kpte() is always valid in dynamic_stack_fault().
+	 */
+	BUILD_BUG_ON((PMD_SIZE % THREAD_SIZE) || (THREAD_ALIGN % THREAD_SIZE));
+
+	vm_area = get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node,
+				   gfp, __builtin_return_address(0));
+	if (!vm_area)
+		return NULL;
+
+	vm_area->pages = kmalloc_node(sizeof(void *) *
+				      (THREAD_SIZE >> PAGE_SHIFT), gfp, node);
+	if (!vm_area->pages)
+		goto cleanup_err;
+
+	for (i = 0; i < THREAD_PREALLOC_PAGES; i++) {
+		vm_area->pages[i] = alloc_pages(gfp, 0);
+		if (!vm_area->pages[i])
+			goto cleanup_err;
+		vm_area->nr_pages++;
+	}
+
+	addr = (unsigned long)vm_area->addr +
+					(THREAD_DYNAMIC_PAGES << PAGE_SHIFT);
+	end = (unsigned long)vm_area->addr + THREAD_SIZE;
+	err = vmap_pages_range(addr, end, PAGE_KERNEL, vm_area->pages, PAGE_SHIFT);
+	if (err)
+		goto cleanup_err;
+
+	return vm_area;
+cleanup_err:
+	free_vmap_stack(vm_area);
+	return NULL;
+}
+
+static struct page *noinstr dynamic_stack_get_page(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (!page)
+			continue;
+		pages[i] = NULL;
+		return page;
+	}
+
+	return NULL;
+}
+
+static int dynamic_stack_refill_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (pages[i])
+			continue;
+		pages[i] = alloc_pages(vmap_stack_gfp(false), 0);
+		if (unlikely(!pages[i])) {
+			pr_err("failed to allocate dynamic stack page for cpu[%d]\n",
+			       cpu);
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static int dynamic_stack_free_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (!pages[i])
+			continue;
+		__free_page(pages[i]);
+		pages[i] = NULL;
+	}
+
+	return 0;
+}
+
+void dynamic_stack_refill_pages(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (page)
+			continue;
+
+		/*
+		 * This is called during context switch, so we can't take any
+		 * sleeping locks. As such, we need to use GFP_ATOMIC.
+		 */
+		page = alloc_pages(vmap_stack_gfp(true), 0);
+		if (unlikely(!page))
+			pr_err_ratelimited("failed to refill per-cpu dynamic stack\n");
+		pages[i] = page;
+	}
+}
+
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
+{
+	struct vm_struct *vm_area = tsk->stack_vm_area;
+	unsigned long nr_accounted, i;
+
+	cant_sleep();
+
+	/* Verify enough low order bits in the page-aligned stack pointer. */
+	BUILD_BUG_ON(THREAD_PREALLOC_PAGES == 0 ||
+		     PAGE_SIZE - 1 <= DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+
+	nr_accounted = tsk->packed_stack & DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+
+	if (nr_accounted == DYNAMIC_STACK_MAX_ACCOUNT_MASK) {
+		WARN_ON_ONCE(finalize);
+		return 0;
+	}
+
+	for (i = THREAD_PREALLOC_PAGES + nr_accounted; i < vm_area->nr_pages; i++) {
+		struct page *page = vm_area->pages[i];
+
+		int ret = memcg_kmem_charge_page(page, GFP_ATOMIC, 0);
+		/*
+		 * XXX Since stack pages were already allocated, we should never
+		 * fail charging. Therefore, we should probably induce force
+		 * charge and oom killing if charge fails.
+		 */
+		if (unlikely(ret))
+			pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n");
+
+		mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
+				      PAGE_SIZE / 1024);
+	}
+
+	if (finalize) {
+		tsk->packed_stack |= DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+	} else {
+		tsk->packed_stack &= ~DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+		tsk->packed_stack |= (i - THREAD_PREALLOC_PAGES);
+	}
+
+	return i;
+}
+
+bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
+{
+	unsigned long stack, hole_end, addr;
+	struct vm_struct *vm_area;
+	struct page *page;
+	int nr_pages;
+	pte_t *pte;
+
+	cant_sleep();
+
+	if (WARN_ON(in_nmi())) {
+		*on_stack = false;
+		return false;
+	}
+
+	/* check if address is inside the kernel stack area */
+	stack = (unsigned long)task_stack_page(tsk);
+	if (address < stack || address >= stack + THREAD_SIZE) {
+		*on_stack = false;
+		return false;
+	}
+	*on_stack = true;
+
+	vm_area = tsk->stack_vm_area;
+	if (WARN_ON_ONCE(!vm_area))
+		return false;
+
+	nr_pages = vm_area->nr_pages;
+
+	/* Check if fault address is within the stack hole */
+	hole_end = stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT);
+	if (address >= hole_end)
+		return false;
+
+	/*
+	 * Most likely we faulted in the page right next to the last mapped
+	 * page in the stack, however, it is possible (but very unlikely) that
+	 * the faulted page is actually skips some pages in the stack. Make sure
+	 * we do not create  more than one holes in the stack, and map every
+	 * page between the current fault  address and the last page that is
+	 * mapped in the stack.
+	 */
+	address = PAGE_ALIGN_DOWN(address);
+	for (addr = hole_end - PAGE_SIZE; addr >= address; addr -= PAGE_SIZE) {
+		/* Take the next page from the per-cpu list */
+		page = dynamic_stack_get_page();
+		if (!page) {
+			instrumentation_begin();
+			pr_emerg("Failed to allocate a page during kernel_stack_fault\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Add the new page entry to the page table */
+		pte = virt_to_kpte(addr);
+		if (!pte) {
+			instrumentation_begin();
+			pr_emerg("The PTE page table for a kernel stack is not found\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Make sure there are no existing mappings at this address */
+		if (pte_present(*pte)) {
+			instrumentation_begin();
+			pr_emerg("The PTE contains a mapping\n");
+			instrumentation_end();
+			return false;
+		}
+		set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+
+		/* Store the new page in the stack's vm_area */
+		vm_area->pages[nr_pages] = page;
+		vm_area->nr_pages = ++nr_pages;
+	}
+
+	/* Refill the pcp stack pages during context switch */
+	tsk->flags |= PF_DYNAMIC_STACK;
+
+	return true;
+}
+
+#else /* !CONFIG_DYNAMIC_STACK */
 static inline struct vm_struct *alloc_vmap_stack(int node)
 {
 	void *stack;
 
-	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, vmap_stack_gfp(false),
 			       node, __builtin_return_address(0));
 
 	return stack ? find_vm_area(stack) : NULL;
@@ -284,6 +573,13 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
 	vfree(vm_area->addr);
 }
 
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->stack = kasan_reset_tag(vm_area->addr);
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static void thread_stack_free_work(struct work_struct *work)
 {
 	struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
@@ -300,9 +596,9 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 	struct vm_stack *vm_stack;
 
 	if (IS_ENABLED(CONFIG_STACK_GROWSUP))
-		vm_stack = tsk->stack;
+		vm_stack = task_stack_page(tsk);
 	else
-		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
+		vm_stack = task_stack_page(tsk) + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
 	INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
@@ -361,14 +657,13 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-		tsk->stack = kasan_reset_tag(vm_area->addr);
+		link_vmap_stack_to_task(tsk, vm_area);
 
 		/* Clear stale pointers from reused stack. */
 		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
 			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
-		memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+		memset(task_stack_page(tsk) + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
-		tsk->stack_vm_area = vm_area;
 		return 0;
 	}
 
@@ -380,22 +675,20 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		free_vmap_stack(vm_area);
 		return -ENOMEM;
 	}
-	/*
-	 * We can't call find_vm_area() in interrupt context, and
-	 * free_thread_stack() can be called in interrupt context,
-	 * so cache the vm_struct.
-	 */
-	tsk->stack_vm_area = vm_area;
-	tsk->stack = kasan_reset_tag(vm_area->addr);
+	link_vmap_stack_to_task(tsk, vm_area);
 	return 0;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	if (!try_release_thread_stack_to_cache(tsk->stack_vm_area))
+	if (!try_release_thread_stack_to_cache(task_stack_vm_area(tsk)))
 		thread_stack_delayed_free(tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+	tsk->packed_stack = 0;
+#else
 	tsk->stack = NULL;
+#endif
 	tsk->stack_vm_area = NULL;
 }
 
@@ -498,9 +791,27 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
-		int i;
+		int i, nr_accounted;
 
-		for (i = 0; i < vm_area->nr_pages; i++)
+#ifdef CONFIG_DYNAMIC_STACK
+		/*
+		 * For the exit path, resolve any pending accounting to avoid
+		 * underflow. Finalize to skip accounting for any faults that
+		 * happen between here and this thread's final __schedule()
+		 * call in do_task_dead().
+		 */
+		if (account < 0) {
+			preempt_disable();
+			nr_accounted = dynamic_stack_accounting(tsk, true);
+			preempt_enable();
+		} else {
+			nr_accounted = THREAD_PREALLOC_PAGES;
+		}
+#else
+		nr_accounted = vm_area->nr_pages;
+#endif
+
+		for (i = 0; i < nr_accounted; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -901,6 +1212,16 @@ void __init fork_init(void)
 			  NULL, free_vm_stack_cache);
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack",
+			  dynamic_stack_refill_pages_cpu,
+			  dynamic_stack_free_pages_cpu);
+	/*
+	 * Fill the dynamic stack pages for the boot CPU, others will be filled
+	 * as CPUs are onlined.
+	 */
+	dynamic_stack_refill_pages_cpu(smp_processor_id());
+#endif
 	scs_init();
 
 	lockdep_init_task(&init_task);
@@ -914,6 +1235,7 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
 	return 0;
 }
 
+#ifndef CONFIG_DYNAMIC_STACK
 void set_task_stack_end_magic(struct task_struct *tsk)
 {
 	unsigned long *stackend;
@@ -921,6 +1243,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 }
+#endif
 
 static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..417269a86973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6783,6 +6783,7 @@ static void __sched notrace __schedule(int sched_mode)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
+	dynamic_stack(prev);
 	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (6 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

CONFIG_DEBUG_STACK_USAGE is enabled by default on most architectures.

Its purpose is to determine and print the maximum stack depth on
thread exit.

The way it works, is it starts from the bottom of the stack and
searches the first non-zero word in the stack. With dynamic stack it
does not work very well, as it means it faults every pages in every
stack.

Instead, add a specific version of stack_not_used() for dynamic stacks
where instead of starting from the bottom of the stack, we start from
the last page mapped in the stack.

In addition to not doing unnecessary page faulting, this search is
optimized by skipping search through zero pages.

Also, because dynamic stack does not end with MAGIC_NUMBER, there is
no need to skip the bottom most word in the stack.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, Kasan oneliner needed preserving, rewrote a bit due to bugs]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Handle init_task's use of init_stack, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/Kconfig  |  1 -
 kernel/exit.c | 22 ++++++++++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 95ded79f0825..beffe7e01296 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1542,7 +1542,6 @@ config DYNAMIC_STACK
 	depends on VMAP_STACK
 	depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
 	depends on !KASAN
-	depends on !DEBUG_STACK_USAGE
 	depends on !STACK_GROWSUP
 	depends on !PREEMPT_RT
 	help
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117fa7d4..6caf4030e8f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -71,6 +71,7 @@
 #include <linux/unwind_deferred.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
+#include <linux/vmalloc.h>
 
 #include <uapi/linux/wait.h>
 
@@ -791,6 +792,26 @@ unsigned long stack_not_used(struct task_struct *p)
 	return (unsigned long)end_of_stack(p) - (unsigned long)n;
 }
 #else /* !CONFIG_STACK_GROWSUP */
+#ifdef CONFIG_DYNAMIC_STACK
+unsigned long stack_not_used(struct task_struct *p)
+{
+	struct vm_struct *vm_area = task_stack_vm_area(p);
+	unsigned long stack = (unsigned long)task_stack_page(p);
+	unsigned long alloc_size, *n;
+
+	/* This is NULL only for init_task, where init_stack is fully allocated. */
+	if (likely(vm_area))
+		alloc_size = vm_area->nr_pages << PAGE_SHIFT;
+	else
+		alloc_size = THREAD_SIZE;
+	n = (unsigned long *)(stack + THREAD_SIZE - alloc_size);
+
+	while (!*n)
+		n++;
+
+	return (unsigned long)n - stack;
+}
+#else
 unsigned long stack_not_used(struct task_struct *p)
 {
 	unsigned long *n = end_of_stack(p);
@@ -801,6 +822,7 @@ unsigned long stack_not_used(struct task_struct *p)
 
 	return (unsigned long)n - (unsigned long)end_of_stack(p);
 }
+#endif /* CONFIG_DYNAMIC_STACK */
 #endif /* CONFIG_STACK_GROWSUP */
 
 /* Count the maximum pages reached in kernel stacks */
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (7 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

Add an accounting of the amount of stack pages that have been faulted in
and are currently in use.

Example use case:
  $ cat /proc/vmstat | grep stack
  nr_kernel_stack 18684
  nr_dynamic_stacks_faults 156

The above shows that the kernel stacks use total 18684KiB, out of which
156KiB were faulted in.

Given that the pre-allocated stacks are 4KiB, we can determine the total
number of tasks:

tasks = (nr_kernel_stack - nr_dynamic_stacks_faults) / 4 = 4632.

The amount of kernel stack memory without dynamic stack on this machine
would be:

4632 * 16 KiB = 74,112 KiB

Therefore, in this example dynamic stacks save: 55,428 KiB

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[add to memcg stats, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/mmzone.h |  3 +++
 kernel/fork.c          | 12 +++++++++++-
 mm/memcontrol.c        | 10 ++++++++++
 mm/vmstat.c            |  3 +++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..4458fa7016a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -221,6 +221,9 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
+#ifdef CONFIG_DYNAMIC_STACK
+	NR_DYNAMIC_STACKS_FAULTS_KB, /* KiB of faulted kernel stack memory */
+#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e615ef736dc0..9ac9d23f5f4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,6 +463,8 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
 
 		mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
 				      PAGE_SIZE / 1024);
+		mod_lruvec_page_state(page, NR_DYNAMIC_STACKS_FAULTS_KB,
+				      PAGE_SIZE / 1024);
 	}
 
 	if (finalize) {
@@ -811,9 +813,17 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 		nr_accounted = vm_area->nr_pages;
 #endif
 
-		for (i = 0; i < nr_accounted; i++)
+		for (i = 0; i < nr_accounted; i++) {
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
+#ifdef CONFIG_DYNAMIC_STACK
+			if (i >= THREAD_PREALLOC_PAGES) {
+				mod_lruvec_page_state(vm_area->pages[i],
+						      NR_DYNAMIC_STACKS_FAULTS_KB,
+						      account * (PAGE_SIZE / 1024));
+			}
+#endif
+		}
 	} else {
 		void *stack = task_stack_page(tsk);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cd2195a735ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -318,6 +318,9 @@ static const unsigned int memcg_node_stat_items[] = {
 	NR_FILE_THPS,
 	NR_ANON_THPS,
 	NR_KERNEL_STACK_KB,
+#ifdef CONFIG_DYNAMIC_STACK
+	NR_DYNAMIC_STACKS_FAULTS_KB,
+#endif
 	NR_PAGETABLE,
 	NR_SECONDARY_PAGETABLE,
 #ifdef CONFIG_SWAP
@@ -1403,6 +1406,10 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
+
+#ifdef CONFIG_DYNAMIC_STACK
+	{ "dynamic_stack_faults",	NR_DYNAMIC_STACKS_FAULTS_KB     },
+#endif
 };
 
 /* The actual unit of the state item, not the same as the output unit */
@@ -1415,6 +1422,9 @@ static int memcg_page_state_unit(int item)
 	case NR_SLAB_UNRECLAIMABLE_B:
 		return 1;
 	case NR_KERNEL_STACK_KB:
+#ifdef CONFIG_DYNAMIC_STACK
+	case NR_DYNAMIC_STACKS_FAULTS_KB:
+#endif
 		return SZ_1K;
 	default:
 		return PAGE_SIZE;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..8fa1c7bcbaea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1256,6 +1256,9 @@ const char * const vmstat_text[] = {
 	[I(NR_FOLL_PIN_ACQUIRED)]		= "nr_foll_pin_acquired",
 	[I(NR_FOLL_PIN_RELEASED)]		= "nr_foll_pin_released",
 	[I(NR_KERNEL_STACK_KB)]			= "nr_kernel_stack",
+#ifdef CONFIG_DYNAMIC_STACK
+	[I(NR_DYNAMIC_STACKS_FAULTS_KB)]	= "nr_dynamic_stacks_faults",
+#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	[I(NR_KERNEL_SCS_KB)]			= "nr_shadow_call_stack",
 #endif
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (8 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Store the task pointer in the ptes of the unpopulated pages of dynamic
stacks, to allow the vm_struct pointer to be retrieved without relying
on any locks or current.

This relies on being able to pack the struct task_struct pointer into a
pte. Since the struct is 64 byte aligned, that gives 5 bits of leeway,
which should be viable on most architectures.  Any architecture which
enables dynamic thread stacks must provide make_data_kpte() and
unpack_data_kpte(), which pack/unpack a right shifted pointer value
into/from a pte.

Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/sched/task_stack.h |  1 +
 kernel/fork.c                    | 74 +++++++++++++++++++++++++++++---
 mm/vmalloc.c                     |  2 +-
 3 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 7dcff2836d7e..7cf00ce97f7c 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -105,6 +105,7 @@ void exit_task_stack_account(struct task_struct *tsk);
 void dynamic_stack_refill_pages(void);
 unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
 bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+struct task_struct *task_from_stack_address(unsigned long address);
 
 /*
  * Refill and charge for the used pages.
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ac9d23f5f4b..733fc1f58b8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -296,16 +296,40 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 
 static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
 
+#define TASK_PTR_SHIFT (ilog2(__alignof__(struct task_struct)))
+
 static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
 {
+	int i;
+	unsigned long addr;
+	pte_t *ptep, pte;
+
+	pte = make_data_kpte(((unsigned long)tsk) >> TASK_PTR_SHIFT);
+
 	tsk->stack_vm_area = vm_area;
 	tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+
+	addr = (unsigned long)vm_area->addr;
+	ptep = virt_to_kpte(addr);
+	for (i = vm_area->nr_pages; i < THREAD_SIZE >> PAGE_SHIFT;
+	     i++, addr += PAGE_SIZE, ptep++)
+		set_pte_at(&init_mm, addr, ptep, pte);
 }
 
-static void free_vmap_stack(struct vm_struct *vm_area)
+static void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
 {
 	int i;
 
+	/* Clear data kptes since vunmap expects present or none. */
+	if (was_mapped) {
+		unsigned long addr = (unsigned long)vm_area->addr;
+		pte_t *ptep = virt_to_kpte(addr);
+		unsigned int nr_to_clear = (THREAD_SIZE >> PAGE_SHIFT) - vm_area->nr_pages;
+
+		if (nr_to_clear)
+			clear_ptes(&init_mm, addr, ptep, nr_to_clear);
+	}
+
 	remove_vm_area(vm_area->addr);
 
 	for (i = 0; i < vm_area->nr_pages; i++)
@@ -354,7 +378,7 @@ static struct vm_struct *alloc_vmap_stack(int node)
 
 	return vm_area;
 cleanup_err:
-	free_vmap_stack(vm_area);
+	free_vmap_stack(vm_area, false);
 	return NULL;
 }
 
@@ -477,6 +501,42 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
 	return i;
 }
 
+noinstr struct task_struct *task_from_stack_address(unsigned long address)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	BUILD_BUG_ON((BITS_PER_LONG - TASK_PTR_SHIFT) > KPTE_AVAILABLE_DATA_BITS);
+
+	if (!is_vmalloc_addr((void *)address))
+		return NULL;
+
+	pgd = pgd_offset_k(address);
+	if (pgd_none(*pgd) || pgd_leaf(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d) || p4d_leaf(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, address);
+	if (pud_none(*pud) || pud_leaf(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || pmd_leaf(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_present(*pte) || pte_none(*pte))
+		return NULL;
+
+	return (struct task_struct *)(unpack_data_kpte(*pte) << TASK_PTR_SHIFT);
+}
+
 bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
 {
 	unsigned long stack, hole_end, addr;
@@ -570,7 +630,7 @@ static inline struct vm_struct *alloc_vmap_stack(int node)
 	return stack ? find_vm_area(stack) : NULL;
 }
 
-static inline void free_vmap_stack(struct vm_struct *vm_area)
+static inline void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
 {
 	vfree(vm_area->addr);
 }
@@ -590,7 +650,7 @@ static void thread_stack_free_work(struct work_struct *work)
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
 		return;
 
-	free_vmap_stack(vm_area);
+	free_vmap_stack(vm_area, true);
 }
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -618,7 +678,7 @@ static int free_vm_stack_cache(unsigned int cpu)
 		if (!vm_area)
 			continue;
 
-		free_vmap_stack(vm_area);
+		free_vmap_stack(vm_area, true);
 		cached_vm_stack_areas[i] = NULL;
 	}
 
@@ -653,7 +713,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		unsigned long memset_offset = 0;
 
 		if (memcg_charge_kernel_stack(vm_area)) {
-			free_vmap_stack(vm_area);
+			free_vmap_stack(vm_area, true);
 			return -ENOMEM;
 		}
 
@@ -674,7 +734,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		return -ENOMEM;
 
 	if (memcg_charge_kernel_stack(vm_area)) {
-		free_vmap_stack(vm_area);
+		free_vmap_stack(vm_area, true);
 		return -ENOMEM;
 	}
 	link_vmap_stack_to_task(tsk, vm_area);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39b7e118cbce..76955c101180 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -76,7 +76,7 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
-bool is_vmalloc_addr(const void *x)
+noinstr bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (9 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Add missing ENCODE_FRAME_POINTER macro invocation into FRED_ENTER macro,
to prevent the unwinder from encountering a NULL stack frame pointer
when CONFIG_UNWINDER_FRAME_POINTER is enabled

Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/entry/entry_64_fred.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..119b8214748e 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -7,6 +7,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/asm.h>
+#include <asm/frame.h>
 #include <asm/fred.h>
 #include <asm/segment.h>
 
@@ -19,6 +20,7 @@
 	UNWIND_HINT_END_OF_STACK
 	ANNOTATE_NOENDBR
 	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	movq	%rsp, %rdi	/* %rdi -> pt_regs */
 .endm
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (10 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Add support for dynamic kernel stack faults by handling #PFs from CPL 0
on stack level 1. Since we can't sleep while on a per-CPU stack, any
page faults that didn't originate in an atomic context need to be
bounced back to the originating stack.

With dynamic kernel stacks, the processor pushing data onto the kernel
thread stack can cause a page fault. The SDM says in the #DF section
that the processor should be able to handle these exceptions serially.
However, this does not seem to actually be handled reliably.

With KVM, I've observed timer interrupts dropped. The corresponding bit
in VIRR is cleared and the ISR bit in the APIC is set before the #PF is
delivered, but the interrupt handler is not invoked after the kernel
stack fault is resolved. On bare metal, I've observed frequent hangs due
to threads getting stuck on folio_wait_bit_common. I haven't traced this
to an exact interrupt being lost, but moving interrupts to stack level 1
reduces boot failures from >10% to 0 in 1000s of attempts.

To work around this, external interrupts are also moved to stack level
1, and unconditionally bounced back to the originating stack.

Bouncing page faults and external interrupts through stack level 1 while
in CPL 0 adds a small but non-trivial overhead to those paths. The
shared entry point for events received in CPL 0 also becomes slightly
more expensive, due to the need to detect page faults and external
interrupts.

Since enabling HAVE_ARCH_DYNAMIC_STACK requires unconditional support,
enabling the config is done in the next patch that adds dynamic stack
support for traditional interrupt delivery.

Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/entry/entry_64_fred.S    | 55 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 36 ++++++++++++++++++++
 arch/x86/include/asm/traps.h      |  5 +++
 arch/x86/kernel/fred.c            | 20 ++++++++---
 arch/x86/mm/dump_pagetables.c     | 14 +++++---
 arch/x86/mm/fault.c               | 53 +++++++++++++++++++++++++++++
 6 files changed, 174 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 119b8214748e..7202655ef662 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -54,7 +54,62 @@ SYM_CODE_END(asm_fred_entrypoint_user)
 	.org asm_fred_entrypoint_user + 256, 0xcc
 SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
 	FRED_ENTER
+
+#ifdef CONFIG_DYNAMIC_STACK
+	/* Extract event type and vector from augmented SS. */
+	movl	(SS + 4)(%rsp), %esi
+	andl	$0x000f00ff, %esi
+
+	/* Check if event type is hardware exception and vector is #PF. */
+	cmpl	$0x0003000e, %esi
+	jne	.Lcheck_for_extint
+
+	call	handle_dynamic_stack_kernel_faults
+	testq	%rax, %rax
+	jz	.Lentrypoint_done
+	cmpq	%rax, %rsp
+	je	.Lskip_stack_switch
+	jmp	.Ldo_stack_switch
+
+.Lcheck_for_extint:
+	/* Check if event type is external interrupt. */
+	andl	$0xf0000, %esi
+	testl	%esi, %esi
+	jne	.Lcall_primary_entry
+	call	switch_to_kstack
+
+.Ldo_stack_switch:
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * We should only do a stack switch for an external interrupt or a page
+	 * fault in a non-atomic context. These should only ever happen in user
+	 * space or from a regular kernel stack (i.e. CSL == 0).
+	 */
+	movw	(CS + 2)(%rsp), %si
+	testw	$0x3, %si
+	jz	.Lcsl_ok
+	ud2
+.Lcsl_ok:
+#endif
+	movq	%rax, %rsp
+
+	UNWIND_HINT_REGS
+	ENCODE_FRAME_POINTER
+
+	mov	$MSR_IA32_FRED_CONFIG, %ecx
+	rdmsr
+	andl	$~0x3, %eax
+	wrmsr
+
+	movq	%rsp, %rdi
+#endif
+
+.Lskip_stack_switch:
+	movq	%rsp, %rdi
+.Lcall_primary_entry:
 	call	fred_entry_from_kernel
+
+.Lentrypoint_done:
 	FRED_EXIT
 	ERETS
 SYM_CODE_END(asm_fred_entrypoint_kernel)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index ce45882ccd07..fbb042c89d13 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -237,6 +237,42 @@ static inline void native_pgd_clear(pgd_t *pgd)
 #define __swp_entry_to_pte(x)		(__pte((x).val))
 #define __swp_entry_to_pmd(x)		(__pmd((x).val))
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * Skip the present bit. And skip dirty and accessed bits due to
+ * erratum where they can be incorrectly set on non-present ptes.
+ *
+ * Also skip bit 8, which is used for pte_present for PROT_NONE. This
+ * isn't necessary in the strictest sense since PROT_NONE doesn't apply
+ * to kernel PTEs, but it's easier to let pte_present just continue
+ * to work.
+ */
+#define KPTE_AVAILABLE_DATA_BITS 58
+
+static inline pte_t make_data_kpte(unsigned long val)
+{
+	unsigned long low_part, mid_part, high_part;
+
+	low_part = (val & 0xf) << 1;
+	mid_part = (val & 0x10) << 3;
+	high_part = (val & ~0x1f) << 4;
+
+	return __pte(low_part | mid_part | high_part);
+}
+
+static inline unsigned long unpack_data_kpte(pte_t pte)
+{
+	unsigned long val = pte_val(pte), high_part, mid_part, low_part;
+
+	low_part = (val >> 1) & 0xf;
+	mid_part = (val >> 3) & 0x10;
+	high_part = (val >> 4) & ~0x1f;
+
+	return low_part | mid_part | high_part;
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
 extern void cleanup_highmap(void);
 
 #define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 3f24cc472ce9..6b55eb91aea6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,6 +15,11 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
 struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+
+#ifdef CONFIG_DYNAMIC_STACK
+asmlinkage __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs);
+asmlinkage __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs);
+#endif
 #endif
 
 extern int ibt_selftest(void);
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index e736b19e18de..01d727420d1f 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -9,6 +9,8 @@
 
 /* #DB in the kernel would imply the use of a kernel debugger. */
 #define FRED_DB_STACK_LEVEL		1UL
+#define FRED_PF_STACK_LEVEL		1UL
+#define FRED_INT_STACK_LEVEL		1UL
 #define FRED_NMI_STACK_LEVEL		2UL
 #define FRED_MC_STACK_LEVEL		2UL
 /*
@@ -25,6 +27,11 @@
 DEFINE_PER_CPU(unsigned long, fred_rsp0);
 EXPORT_PER_CPU_SYMBOL(fred_rsp0);
 
+#define FRED_CONFIG_VAL(int_stklvl) \
+	(FRED_CONFIG_REDZONE /* Reserve for CALL emulation */ | \
+	 FRED_CONFIG_INT_STKLVL(int_stklvl) | \
+	 FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user))
+
 void cpu_init_fred_exceptions(void)
 {
 	/* When FRED is enabled by default, remove this log message */
@@ -44,11 +51,7 @@ void cpu_init_fred_exceptions(void)
 	 */
 	loadsegment(ss, __KERNEL_DS);
 
-	wrmsrq(MSR_IA32_FRED_CONFIG,
-	       /* Reserve for CALL emulation */
-	       FRED_CONFIG_REDZONE |
-	       FRED_CONFIG_INT_STKLVL(0) |
-	       FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
+	wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(0));
 
 	wrmsrq(MSR_IA32_FRED_STKLVLS, 0);
 
@@ -84,8 +87,15 @@ void cpu_init_fred_rsps(void)
 	       FRED_STKLVL(X86_TRAP_DB,  FRED_DB_STACK_LEVEL) |
 	       FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) |
 	       FRED_STKLVL(X86_TRAP_MC,  FRED_MC_STACK_LEVEL) |
+#ifdef CONFIG_DYNAMIC_STACK
+	       FRED_STKLVL(X86_TRAP_PF,  FRED_PF_STACK_LEVEL) |
+#endif
 	       FRED_STKLVL(X86_TRAP_DF,  FRED_DF_STACK_LEVEL));
 
+#ifdef CONFIG_DYNAMIC_STACK
+	wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(FRED_INT_STACK_LEVEL));
+#endif
+
 	/* The FRED equivalents to IST stacks... */
 	wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
 	wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2afa7a23340e..5c33c33e93fe 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -306,11 +306,17 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
 	static const char units[] = "BKMGTPE";
 	struct seq_file *m = st->seq;
 
-	new_prot = val & PTE_FLAGS_MASK;
-	if (!val)
+	/* Ignore prot/eff from data kptes. */
+	if (val & _PAGE_PRESENT || addr < address_markers[KERNEL_SPACE_NR].start_address) {
+		new_prot = val & PTE_FLAGS_MASK;
+		if (!val)
+			new_eff = 0;
+		else
+			new_eff = st->prot_levels[level];
+	} else {
+		new_prot = 0;
 		new_eff = 0;
-	else
-		new_eff = st->prot_levels[level];
+	}
 
 	/*
 	 * If we have a "break" in the series, we need to flush the state that
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b83a06739b51..40d518d9f562 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1480,6 +1480,59 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 	local_irq_disable();
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+{
+	unsigned long new_sp;
+	unsigned long data_len;
+
+	new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+	new_sp &= FRED_STACK_FRAME_RSP_MASK;
+	data_len = sizeof(struct fred_frame);
+	new_sp -= data_len;
+
+	memcpy((void *)new_sp, regs, data_len);
+
+	return new_sp;
+}
+
+__visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
+{
+	return copy_stack_data(regs);
+}
+
+#define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
+
+__visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs)
+{
+	unsigned long address;
+	struct task_struct *tsk;
+	bool on_stack;
+
+	address = fred_event_data(regs);
+	if (fault_in_kernel_space(address) && !in_nmi()) {
+		tsk = task_from_stack_address(address);
+
+		if (tsk && dynamic_stack_fault(tsk, address, &on_stack)) {
+			WARN_ON_ONCE(tsk != current &&
+				     ALIGN_TO_STACK(regs->sp) != ALIGN_TO_STACK(address));
+			return 0;
+		}
+	}
+
+	/*
+	 * The regular fault handler won't sleep when executing in an
+	 * atomic context, so we can complete the #PF directly on the
+	 * #PF stack.
+	 */
+	if (in_atomic())
+		return (unsigned long)regs;
+	else
+		return copy_stack_data(regs);
+}
+#endif
+
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
 	irqentry_state_t state;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (11 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

On hardware that doesn't support FRED, use ISTs to support dynamic
kernel stacks. In the same way as we do when using FRED, any regular #PF
gets manually moved back onto the original stack. Additionally, we take
the similar approach as we do with FRED to avoid issues with interrupt
re-delivery and handle external interrupts on an IST stack.

The fact that IST stacks aren't reentrant means we have to be very
careful to avoid triggering a #PF while the #PF IST is being used. Since
NMIs can trigger #PFs, we have the NMI handler temporarily install a
secondary #PF IST stack if it detects it came from the #PF IST stack, to
avoid clobbering that stack. Note that although iret unmasking of NMIs
can cause us to get a second NMI while an NMI is on the #PF IST stack,
the actual handling of that secondary NMI will be delayed until after
the original NMI (and thus the #PF) is resolved. As such, one extra #PF
IST stack is sufficient to resolve reentrancy issues with respect to
NMIs.

For #DB exceptions, we make sure that all code that executes on the #PF
IST stack is noinstr. Unfortunately this is not 100% bulletproof, since
the handler needs to access data outside of cpu_entry_area (e.g.
current, current's stack, vmap stack page tables), and the user could
have set hardware breakpoints on accesses to those addresses. Rather
than handle this edge case that should only occur during manual
debugging, we just detect reentrancy on the #PF IST and abort.

It is possible for #MCE to occur on the #PF IST stack, but the #MCE
handler shouldn't generate new #PFs. The reentrancy check on the #PF
stack will trigger if any recoverable #MCEs do generate #PFs - if there
are actually reports of it happening, we can address it then.

Bouncing all #PF and external interrupts through IST stacks adds some
overhead. However, such events from userspace already had to bounce
through the CPU entry stack, so introducing ISTs only adds notable
overhead for #PFs and external interrupts that occur while in CPL 0.

Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/Kconfig                      |  1 +
 arch/x86/entry/entry_64.S             | 49 +++++++++++++++++--
 arch/x86/include/asm/cpu_entry_area.h | 18 +++++++
 arch/x86/include/asm/idtentry.h       | 38 ++++++++++++++-
 arch/x86/include/asm/page_64_types.h  | 10 +++-
 arch/x86/include/asm/processor.h      |  6 +++
 arch/x86/kernel/cpu/common.c          | 11 +++++
 arch/x86/kernel/dumpstack_64.c        | 10 +++-
 arch/x86/kernel/idt.c                 | 57 +++++++++++++---------
 arch/x86/kernel/nmi.c                 |  9 ++++
 arch/x86/lib/usercopy.c               |  9 ++++
 arch/x86/mm/cpu_entry_area.c          | 17 +++++++
 arch/x86/mm/fault.c                   | 70 ++++++++++++++++++++++-----
 13 files changed, 262 insertions(+), 43 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..182fda721b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -212,6 +212,7 @@ config X86
 	select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
 	select HAVE_ARCH_USERFAULTFD_MINOR	if X86_64 && USERFAULTFD
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_DYNAMIC_STACK		if X86_64 && !XEN_PV
 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_ASM_MODVERSIONS
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..02dbd00cc4bb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -286,7 +286,7 @@ SYM_CODE_END(xen_error_entry)
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
  */
-.macro idtentry_body cfunc has_error_code:req
+.macro idtentry_body cfunc has_error_code:req kernel_reentry_fn=
 
 	/*
 	 * Call error_entry() and switch to the task stack if from userspace.
@@ -302,6 +302,38 @@ SYM_CODE_END(xen_error_entry)
 	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+	/*
+	 * For entry from userspace, we've also already moved off of
+	 * the IST after calling error_entry above.
+	 */
+	testb	$3, CS(%rsp)
+	jnz	.Lregular_fault_\cfunc
+
+	/* Check and set the reentry canary reserved by IST_ENTRY_OFFSET. */
+	cmpq	$0, (SS + 8)(%rsp)
+	jne	.List_reentry_abort_\cfunc
+	movq	$1, (SS + 8)(%rsp)
+
+	movq	%rsp, %rdi
+	call	\kernel_reentry_fn
+
+	movq	$0, (SS + 8)(%rsp)
+
+	testq	%rax, %rax
+	jnz	.Lchange_stack_\cfunc
+	jmp	error_return
+
+.Lchange_stack_\cfunc:
+	movq	%rax, %rsp
+
+	ENCODE_FRAME_POINTER
+	UNWIND_HINT_REGS
+.Lregular_fault_\cfunc:
+.endif
+#endif
+
 	movq	%rsp, %rdi			/* pt_regs pointer into 1st argument*/
 
 	.if \has_error_code == 1
@@ -314,6 +346,13 @@ SYM_CODE_END(xen_error_entry)
 	call	\cfunc
 
 	jmp	error_return
+
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+.List_reentry_abort_\cfunc:
+	ud2
+.endif
+#endif
 .endm
 
 /**
@@ -322,11 +361,13 @@ SYM_CODE_END(xen_error_entry)
  * @asmsym:		ASM symbol for the entry point
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
+ * @kernel_reentry_fn:  If set, C function to be called on re-entry from
+ *			kernel space before the main handler is invoked.
  *
  * The macro emits code to set up the kernel context for straight forward
  * and simple IDT entries. No IST stack, no paranoid entry checks.
  */
-.macro idtentry vector asmsym cfunc has_error_code:req
+.macro idtentry vector asmsym cfunc has_error_code:req kernel_reentry_fn=
 SYM_CODE_START(\asmsym)
 
 	.if \vector == X86_TRAP_BP
@@ -358,7 +399,7 @@ SYM_CODE_START(\asmsym)
 .Lfrom_usermode_no_gap_\@:
 	.endif
 
-	idtentry_body \cfunc \has_error_code
+	idtentry_body \cfunc \has_error_code \kernel_reentry_fn
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -375,7 +416,7 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_irq vector cfunc
 	.p2align CONFIG_X86_L1_CACHE_SHIFT
-	idtentry \vector asm_\cfunc \cfunc has_error_code=1
+	idtentry \vector asm_\cfunc \cfunc has_error_code=1 kernel_reentry_fn=switch_to_kstack
 .endm
 
 /**
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..5bce3259edee 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -26,6 +26,12 @@
 	char	DB_stack[EXCEPTION_STKSZ];			\
 	char	MCE_stack_guard[guardsize];			\
 	char	MCE_stack[EXCEPTION_STKSZ];			\
+	char	PF_stack_guard[guardsize];			\
+	char	PF_stack[EXCEPTION_STKSZ];			\
+	char	PF2_stack_guard[guardsize];			\
+	char	PF2_stack[EXCEPTION_STKSZ];			\
+	char	UDI_stack_guard[guardsize];			\
+	char	UDI_stack[EXCEPTION_STKSZ];			\
 	char	VC_stack_guard[guardsize];			\
 	char	VC_stack[optional_stack_size];			\
 	char	VC2_stack_guard[guardsize];			\
@@ -50,6 +56,9 @@ enum exception_stack_ordering {
 	ESTACK_NMI,
 	ESTACK_DB,
 	ESTACK_MCE,
+	ESTACK_PF,
+	ESTACK_PF2,
+	ESTACK_UDI,
 	ESTACK_VC,
 	ESTACK_VC2,
 	N_EXCEPTION_STACKS
@@ -144,6 +153,15 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
 	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+bool is_pf_ist_stack(unsigned long addr);
+#else
+static inline bool is_pf_ist_stack(unsigned long addr)
+{
+	return false;
+}
+#endif
+
 #define __this_cpu_ist_top_va(name)					\
 	CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 42bf6a58ec36..d8c846d28a1d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -163,6 +163,16 @@ noinstr void fred_##func(struct pt_regs *regs)
 #define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)			\
 	DECLARE_IDTENTRY_ERRORCODE(vector, func)
 
+/**
+ * DECLARE_IDTENTRY_PF - Declare functions for page fault entry point
+ * @vector:	Vector number (ignored for C)
+ * @func:	Function name of the entry point
+ *
+ * Maps to @DECLARE_IDTENTRY_ERRORCODE().
+ */
+#define DECLARE_IDTENTRY_PF(vector, func)			\
+	DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+
 /**
  * DEFINE_IDTENTRY_RAW_ERRORCODE - Emit code for raw IDT entry points
  * @func:	Function name of the entry point
@@ -391,6 +401,15 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_DF(func)					\
 	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
+/**
+ * DEFINE_IDTENTRY_PF - Emit code for page fault
+ * @func:	Function name of the entry point
+ *
+ * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
+ */
+#define DEFINE_IDTENTRY_PF(func)					\
+	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+
 /**
  * DEFINE_IDTENTRY_VC_KERNEL - Emit code for VMM communication handler
  *			       when raised from kernel mode
@@ -480,6 +499,15 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
 #define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
 	idtentry vector asm_##func func has_error_code=1
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_PF(vector, func)				\
+	idtentry vector asm_##func func has_error_code=1		\
+	kernel_reentry_fn=handle_dynamic_stack_kernel_faults
+#else
+#define DECLARE_IDTENTRY_PF(vector, func)				\
+	DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+#endif
+
 /* Special case for 32bit IRET 'trap'. Do not emit ASM code */
 #define DECLARE_IDTENTRY_SW(vector, func)
 
@@ -494,8 +522,14 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
 	idtentry_irq vector func
 
 /* System vector entries */
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
+	idtentry vector asm_##func func has_error_code=0		\
+	kernel_reentry_fn=switch_to_kstack
+#else
 #define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
 	DECLARE_IDTENTRY(vector, func)
+#endif
 
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)				\
@@ -615,7 +649,7 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
-DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF,	exc_page_fault);
+DECLARE_IDTENTRY_PF(X86_TRAP_PF,		exc_page_fault);
 
 #if defined(CONFIG_IA32_EMULATION)
 DECLARE_IDTENTRY_RAW(IA32_SYSCALL_VECTOR,	int80_emulation);
@@ -699,7 +733,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 #endif
 
 #ifdef CONFIG_SMP
-DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
+DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR,		sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 7400dab373fe..b0b60f83a531 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -28,7 +28,15 @@
 #define	IST_INDEX_NMI		1
 #define	IST_INDEX_DB		2
 #define	IST_INDEX_MCE		3
-#define	IST_INDEX_VC		4
+#define	IST_INDEX_PF		4
+#define	IST_INDEX_UDI		5
+#define	IST_INDEX_VC		6
+
+/*
+ * Offset used for some IST stacks to reserve a slot for re-entry
+ * canary. At the very top of the stack for cache friendliness.
+ */
+#define IST_ENTRY_OFFSET	8
 
 /*
  * Set __PAGE_OFFSET to the most negative possible address +
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a24c7805acdb..fa790731dea0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -573,6 +573,12 @@ static inline void load_sp0(unsigned long sp0)
 
 #endif /* CONFIG_PARAVIRT_XXL */
 
+#ifdef CONFIG_DYNAMIC_STACK
+void install_nmi_pf_stack(bool use_nmi_pf_stack);
+#else
+static inline void install_nmi_pf_stack(bool use_nmi_pf_stack) {}
+#endif
+
 unsigned long __get_wchan(struct task_struct *p);
 
 extern void select_idle_routine(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ec0670114efa..d90a01e2fdd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2377,6 +2377,8 @@ static inline void tss_setup_ist(struct tss_struct *tss)
 	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
 	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
 	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+	tss->x86_tss.ist[IST_INDEX_PF] = __this_cpu_ist_top_va(PF) - IST_ENTRY_OFFSET;
+	tss->x86_tss.ist[IST_INDEX_UDI] = __this_cpu_ist_top_va(UDI) - IST_ENTRY_OFFSET;
 	/* Only mapped when SEV-ES is active */
 	tss->x86_tss.ist[IST_INDEX_VC] = __this_cpu_ist_top_va(VC);
 }
@@ -2665,3 +2667,12 @@ void __init arch_cpu_finalize_init(void)
 	 */
 	mem_encrypt_init();
 }
+
+#ifdef CONFIG_DYNAMIC_STACK
+noinstr void install_nmi_pf_stack(bool use_nmi_pf_stack)
+{
+	unsigned long stack = use_nmi_pf_stack ? __this_cpu_ist_top_va(PF2)
+					       : __this_cpu_ist_top_va(PF);
+	this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_PF], stack - IST_ENTRY_OFFSET);
+}
+#endif
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 6c5defd6569a..6784d31d3eb3 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -24,13 +24,16 @@ static const char * const exception_stack_names[] = {
 		[ ESTACK_NMI	]	= "NMI",
 		[ ESTACK_DB	]	= "#DB",
 		[ ESTACK_MCE	]	= "#MC",
+		[ ESTACK_PF	]	= "#PF",
+		[ ESTACK_PF2	]	= "#PF2",
+		[ ESTACK_UDI	]	= "#UDI",
 		[ ESTACK_VC	]	= "#VC",
 		[ ESTACK_VC2	]	= "#VC2",
 };
 
 const char *stack_type_name(enum stack_type type)
 {
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
 
 	if (type == STACK_TYPE_TASK)
 		return "TASK";
@@ -87,6 +90,9 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
 	EPAGERANGE(NMI),
 	EPAGERANGE(DB),
 	EPAGERANGE(MCE),
+	EPAGERANGE(PF),
+	EPAGERANGE(PF2),
+	EPAGERANGE(UDI),
 	EPAGERANGE(VC),
 	EPAGERANGE(VC2),
 };
@@ -98,7 +104,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
 	struct pt_regs *regs;
 	unsigned int k;
 
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
 
 	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
 	/*
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..7626fa7adfb3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -116,6 +116,10 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_VC,		asm_exc_vmm_communication, IST_INDEX_VC),
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+	ISTG(X86_TRAP_PF,		asm_exc_page_fault, IST_INDEX_PF),
+#endif
+
 	SYSG(X86_TRAP_OF,		asm_exc_overflow),
 };
 
@@ -127,47 +131,55 @@ static const struct idt_data ia32_idt[] __initconst = {
 #endif
 };
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define EXTERNAL_INTR(_vector, _addr)	ISTG(_vector, _addr, IST_INDEX_UDI)
+#define EXTERNAL_INTR_IST_VALUE		(IST_INDEX_UDI + 1)
+#else
+#define EXTERNAL_INTR(_vector, _addr)	INTG(_vector, _addr)
+#define EXTERNAL_INTR_IST_VALUE		0
+#endif
+
 /*
  * The APIC and SMP idt entries
  */
 static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_SMP
-	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
-	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
-	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
-	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
+	EXTERNAL_INTR(RESCHEDULE_VECTOR,		asm_sysvec_reschedule_ipi),
+	EXTERNAL_INTR(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
+	EXTERNAL_INTR(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
+	EXTERNAL_INTR(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
 #ifdef CONFIG_X86_THERMAL_VECTOR
-	INTG(THERMAL_APIC_VECTOR,		asm_sysvec_thermal),
+	EXTERNAL_INTR(THERMAL_APIC_VECTOR,		asm_sysvec_thermal),
 #endif
 
 #ifdef CONFIG_X86_MCE_THRESHOLD
-	INTG(THRESHOLD_APIC_VECTOR,		asm_sysvec_threshold),
+	EXTERNAL_INTR(THRESHOLD_APIC_VECTOR,		asm_sysvec_threshold),
 #endif
 
 #ifdef CONFIG_X86_MCE_AMD
-	INTG(DEFERRED_ERROR_VECTOR,		asm_sysvec_deferred_error),
+	EXTERNAL_INTR(DEFERRED_ERROR_VECTOR,		asm_sysvec_deferred_error),
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
-	INTG(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
-	INTG(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
+	EXTERNAL_INTR(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
+	EXTERNAL_INTR(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
 # if IS_ENABLED(CONFIG_KVM)
-	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
-	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
-	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	EXTERNAL_INTR(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
+	EXTERNAL_INTR(POSTED_INTR_WAKEUP_VECTOR,	asm_sysvec_kvm_posted_intr_wakeup_ipi),
+	EXTERNAL_INTR(POSTED_INTR_NESTED_VECTOR,	asm_sysvec_kvm_posted_intr_nested_ipi),
 # endif
 #ifdef CONFIG_GUEST_PERF_EVENTS
 	INTG(PERF_GUEST_MEDIATED_PMI_VECTOR,	asm_sysvec_perf_guest_mediated_pmi_handler),
 #endif
 # ifdef CONFIG_IRQ_WORK
-	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
+	EXTERNAL_INTR(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
 # endif
-	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
-	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+	EXTERNAL_INTR(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
+	EXTERNAL_INTR(ERROR_APIC_VECTOR,		asm_sysvec_error_interrupt),
 # ifdef CONFIG_X86_POSTED_MSI
-	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+	EXTERNAL_INTR(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
 # endif
 #endif
 };
@@ -206,11 +218,12 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
-static __init void set_intr_gate(unsigned int n, const void *addr)
+static __init void set_intr_gate(unsigned int n, const void *addr, int ist)
 {
 	struct idt_data data;
 
 	init_idt_data(&data, n, addr);
+	data.bits.ist = ist;
 
 	idt_setup_from_table(idt_table, &data, 1, false);
 }
@@ -293,7 +306,7 @@ void __init idt_setup_apic_and_irq_gates(void)
 
 	for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
 		entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
-		set_intr_gate(i, entry);
+		set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
 	}
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -304,7 +317,7 @@ void __init idt_setup_apic_and_irq_gates(void)
 		 * /proc/interrupts.
 		 */
 		entry = spurious_entries_start + IDT_ALIGN * (i - FIRST_SYSTEM_VECTOR);
-		set_intr_gate(i, entry);
+		set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
 	}
 #endif
 	/* Map IDT into CPU entry area and reload it. */
@@ -325,10 +338,10 @@ void __init idt_setup_early_handler(void)
 	int i;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
-		set_intr_gate(i, early_idt_handler_array[i]);
+		set_intr_gate(i, early_idt_handler_array[i], DEFAULT_STACK);
 #ifdef CONFIG_X86_32
 	for ( ; i < NR_VECTORS; i++)
-		set_intr_gate(i, early_ignore_irq);
+		set_intr_gate(i, early_ignore_irq, DEFAULT_STACK);
 #endif
 	load_idt(&idt_descr);
 }
@@ -352,5 +365,5 @@ void __init idt_install_sysvec(unsigned int n, const void *function)
 		return;
 
 	if (!WARN_ON(test_and_set_bit(n, system_vectors)))
-		set_intr_gate(n, function);
+		set_intr_gate(n, function, EXTERNAL_INTR_IST_VALUE);
 }
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..a2444b9d5b71 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -37,6 +37,7 @@
 #include <asm/microcode.h>
 #include <asm/sev.h>
 #include <asm/fred.h>
+#include <asm/cpu_entry_area.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -581,6 +582,11 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 	if (IS_ENABLED(CONFIG_NMI_CHECK_CPU) && ignore_nmis) {
 		WRITE_ONCE(nsp->idt_ignored, nsp->idt_ignored + 1);
 	} else if (!ignore_nmis) {
+		bool protect_pf_ist_stack = is_pf_ist_stack(regs->sp);
+
+		if (protect_pf_ist_stack)
+			install_nmi_pf_stack(true);
+
 		if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) {
 			WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
 			WARN_ON_ONCE(!(nsp->idt_nmi_seq & 0x1));
@@ -590,6 +596,9 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 			WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
 			WARN_ON_ONCE(nsp->idt_nmi_seq & 0x1);
 		}
+
+		if (protect_pf_ist_stack)
+			install_nmi_pf_stack(false);
 	}
 
 	irqentry_nmi_exit(regs, irq_state);
diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index 24b48af27417..75b9f851f428 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -9,6 +9,7 @@
 #include <linux/instrumented.h>
 
 #include <asm/tlbflush.h>
+#include <asm/cpu_entry_area.h>
 
 /**
  * copy_from_user_nmi - NMI safe copy from user
@@ -39,6 +40,14 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
 	if (!nmi_uaccess_okay())
 		return n;
 
+	/*
+	 * IST stacks aren't reentrant, so bail before the possibility of
+	 * a #PF. While on the #PF IST stack, we should only need this
+	 * function for stack dumps (WARN/panic/etc).
+	 */
+	if (is_pf_ist_stack(current_stack_pointer))
+		return n;
+
 	/*
 	 * Even though this function is typically called from NMI/IRQ context
 	 * disable pagefaults so that its behaviour is consistent even when
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..97ac91c109ed 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -156,6 +156,12 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
 	cea_map_stack(DB);
 	cea_map_stack(MCE);
 
+	if (IS_ENABLED(CONFIG_DYNAMIC_STACK)) {
+		cea_map_stack(PF);
+		cea_map_stack(PF2);
+		cea_map_stack(UDI);
+	}
+
 	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
 		if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
 			cea_map_stack(VC);
@@ -173,6 +179,17 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+bool noinstr is_pf_ist_stack(unsigned long addr)
+{
+	struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks);
+	unsigned long top = CEA_ESTACK_TOP(cs, PF2);
+	unsigned long bot = CEA_ESTACK_BOT(cs, PF);
+
+	return addr >= bot && addr < top;
+}
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static void __init setup_cpu_entry_area(unsigned int cpu)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 40d518d9f562..48ef50982c06 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,16 +1482,61 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 #ifdef CONFIG_DYNAMIC_STACK
 
-static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs, bool is_dynamic_stack_fault)
 {
 	unsigned long new_sp;
 	unsigned long data_len;
+	bool must_avoid_dynamic_stack_fault;
 
-	new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
-	new_sp &= FRED_STACK_FRAME_RSP_MASK;
-	data_len = sizeof(struct fred_frame);
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+		new_sp &= FRED_STACK_FRAME_RSP_MASK;
+		data_len = sizeof(struct fred_frame);
+		must_avoid_dynamic_stack_fault = false;
+	} else {
+		// Hardware aligns sp to a 16 byte boundary when going through the IDT.
+		new_sp = ALIGN_DOWN(regs->sp, 16);
+		data_len = sizeof(struct pt_regs);
+		must_avoid_dynamic_stack_fault = is_dynamic_stack_fault;
+	}
 	new_sp -= data_len;
 
+	if (must_avoid_dynamic_stack_fault) {
+		bool new_sp_on_stack;
+
+		/*
+		 * We don't have to worry about the window where current_task
+		 * is inconsistent during a context switch because interrupts
+		 * are disabled during that window and the only #PF that can
+		 * happen there is a dynamic stack fault, in which case we
+		 * return directly from handle_dynamic_stack_kernel_faults().
+		 */
+		if (!in_nmi())
+			dynamic_stack_fault(current, new_sp, &new_sp_on_stack);
+		else
+			new_sp_on_stack = false;
+
+		/*
+		 * If new_sp isn't on the current task's stack, verify that it's
+		 * on an exception/irq/entry stack. This is a little expensive,
+		 * but #PFs in those contexts should be rare.
+		 */
+		if (!new_sp_on_stack) {
+			struct stack_info info, info2;
+
+			if (!get_stack_info_noinstr((void *)new_sp, current, &info)) {
+				instrumentation_begin();
+				if (get_stack_info_noinstr((void *)(new_sp - PAGE_SIZE),
+							   current, &info2)) {
+					pr_emerg("Stack overflow during stack switch\n");
+					handle_stack_overflow(regs, new_sp, &info2);
+				} else {
+					die("Stack switch back to unknown stack", regs, 0);
+				}
+			}
+		}
+	}
+
 	memcpy((void *)new_sp, regs, data_len);
 
 	return new_sp;
@@ -1499,7 +1544,7 @@ static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
 
 __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
 {
-	return copy_stack_data(regs);
+	return copy_stack_data(regs, false);
 }
 
 #define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
@@ -1510,7 +1555,7 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
 	struct task_struct *tsk;
 	bool on_stack;
 
-	address = fred_event_data(regs);
+	address = cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs) : read_cr2();
 	if (fault_in_kernel_space(address) && !in_nmi()) {
 		tsk = task_from_stack_address(address);
 
@@ -1522,18 +1567,19 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
 	}
 
 	/*
-	 * The regular fault handler won't sleep when executing in an
-	 * atomic context, so we can complete the #PF directly on the
-	 * #PF stack.
+	 * The regular fault handler won't sleep when executing in an atomic
+	 * context, so we can complete the #PF directly on the #PF stack.
+	 * However, IST doesn't support nested exceptions, so we need to avoid
+	 * running any non-noinstr code on the IST #PF stack.
 	 */
-	if (in_atomic())
+	if (in_atomic() && cpu_feature_enabled(X86_FEATURE_FRED))
 		return (unsigned long)regs;
 	else
-		return copy_stack_data(regs);
+		return copy_stack_data(regs, true);
 }
 #endif
 
-DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+DEFINE_IDTENTRY_PF(exc_page_fault)
 {
 	irqentry_state_t state;
 	unsigned long address;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (12 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
@ 2026-04-24 19:41 ` Dave Hansen
  2026-04-24 21:35   ` Pasha Tatashin
  2026-04-25  9:19   ` H. Peter Anvin
  13 siblings, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2026-04-24 19:41 UTC (permalink / raw)
  To: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: linux-kernel, linux-mm

On 4/24/26 12:14, David Stevens wrote:
> The question is then: is this approach something that is fundamentally
> untenable in the kernel

Yes. Fundamentally untenable.

Not allowing stack faults has been a wonderful simplification. It's one
of those things that just plain makes the kernel easier to maintain.
Saving low single digits of system memory is not exactly making me eager
to go back to the harder-to-maintain days.

I seriously doubt that this 1% is the lowest hanging fruit for memory
bloat on these systems. ;)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
@ 2026-04-24 21:35   ` Pasha Tatashin
  2026-04-24 22:21     ` Dave Hansen
  2026-04-24 22:26     ` David Laight
  2026-04-25  9:19   ` H. Peter Anvin
  1 sibling, 2 replies; 21+ messages in thread
From: Pasha Tatashin @ 2026-04-24 21:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
> > The question is then: is this approach something that is fundamentally
> > untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

This true until, in a fleet of millions of machines, you encounter a 
one-in-a-billion chance of a stack overflow. You are then forced to 
double the statically allocated kernel stacks on every machine, paying a 
memory tax even though 99.999..% of threads never exceed 4K. This 
overhead accumulates to petabytes of wasted capacity.

Pasha


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 21:35   ` Pasha Tatashin
@ 2026-04-24 22:21     ` Dave Hansen
  2026-04-24 22:49       ` David Stevens
  2026-04-24 22:26     ` David Laight
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2026-04-24 22:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Stevens, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 4/24/26 14:35, Pasha Tatashin wrote:
> On 04-24 12:41, Dave Hansen wrote:
>> On 4/24/26 12:14, David Stevens wrote:
>>> The question is then: is this approach something that is fundamentally
>>> untenable in the kernel
>> Yes. Fundamentally untenable.
>>
>> Not allowing stack faults has been a wonderful simplification. It's one
>> of those things that just plain makes the kernel easier to maintain.
>> Saving low single digits of system memory is not exactly making me eager
>> to go back to the harder-to-maintain days.
>>
>> I seriously doubt that this 1% is the lowest hanging fruit for memory
>> bloat on these systems. 😉
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

I don't disagree with you. But, at that point, you're picking your
poison: bugs dynamic kernel stacks versus crashes from stack overflows.

At some point, I might be able to be talked into dynamic stack as a
FRED-only feature. But FRED isn't widespread enough to go to the trouble
today. I'm sure the folks who want this also don't want to wait until
all the devices in the field have FRED because that even *longer* off.

So maybe this is one of those things that folks just need to deploy
out-of-tree for a couple of years, come back with some data to show us
that we were just paranoid, and we'll look at it again.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 22:21     ` Dave Hansen
@ 2026-04-24 22:49       ` David Stevens
  0 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 22:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, Apr 24, 2026 at 3:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 4/24/26 14:35, Pasha Tatashin wrote:
> > On 04-24 12:41, Dave Hansen wrote:
> >> On 4/24/26 12:14, David Stevens wrote:
> >>> The question is then: is this approach something that is fundamentally
> >>> untenable in the kernel
> >> Yes. Fundamentally untenable.
> >>
> >> Not allowing stack faults has been a wonderful simplification. It's one
> >> of those things that just plain makes the kernel easier to maintain.
> >> Saving low single digits of system memory is not exactly making me eager
> >> to go back to the harder-to-maintain days.
> >>
> >> I seriously doubt that this 1% is the lowest hanging fruit for memory
> >> bloat on these systems. 😉
> > This true until, in a fleet of millions of machines, you encounter a
> > one-in-a-billion chance of a stack overflow. You are then forced to
> > double the statically allocated kernel stacks on every machine, paying a
> > memory tax even though 99.999..% of threads never exceed 4K. This
> > overhead accumulates to petabytes of wasted capacity.
>
> I don't disagree with you. But, at that point, you're picking your
> poison: bugs dynamic kernel stacks versus crashes from stack overflows.
>
> At some point, I might be able to be talked into dynamic stack as a
> FRED-only feature. But FRED isn't widespread enough to go to the trouble
> today. I'm sure the folks who want this also don't want to wait until
> all the devices in the field have FRED because that even *longer* off.

Why does this need to be FRED only? True, the lack of reentrancy with
IST stacks complicates a few situations. That adds some complexity
beyond what's needed for FRED-only support, but the additional
complexity doesn't really seem like a hard blocker, at least if we
accept the complexity of kernel stack faults for FRED.

-David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 21:35   ` Pasha Tatashin
  2026-04-24 22:21     ` Dave Hansen
@ 2026-04-24 22:26     ` David Laight
  2026-04-24 23:06       ` Pasha Tatashin
  1 sibling, 1 reply; 21+ messages in thread
From: David Laight @ 2026-04-24 22:26 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Dave Hansen, David Stevens, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, 24 Apr 2026 21:35:20 +0000
Pasha Tatashin <pasha.tatashin@soleen.com> wrote:

> On 04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:  
> > > The question is then: is this approach something that is fundamentally
> > > untenable in the kernel  
> > 
> > Yes. Fundamentally untenable.
> > 
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> > 
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)  
> 
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

And then you hit a stack fault in some path where you can't sleep and
there isn't any available kernel memory.

An alternative idea is to arrange for some system calls to sleep in
userspace, so when the thread is woken it re-executes the system call.
It then makes sense to assign the kernel stack to the process when
it enters the kernel.
That might mean that you don't need a kernel stack for all the threads
sleeping in futex() - it might even be possible to do the retry in
userspace saving the second kernel entry most of the time.
It is all 'hard and difficult' though.

The easier solution is to rewrite the system code so it doesn't have
1000s of threads :-)

	David



> 
> Pasha
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 22:26     ` David Laight
@ 2026-04-24 23:06       ` Pasha Tatashin
  0 siblings, 0 replies; 21+ messages in thread
From: Pasha Tatashin @ 2026-04-24 23:06 UTC (permalink / raw)
  To: David Laight
  Cc: Pasha Tatashin, Dave Hansen, David Stevens, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andy Lutomirski, Xin Li, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Uladzislau Rezki, Kees Cook, linux-kernel, linux-mm, willy

On 04-24 23:26, David Laight wrote:
> On Fri, 24 Apr 2026 21:35:20 +0000
> Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> 
> > On 04-24 12:41, Dave Hansen wrote:
> > > On 4/24/26 12:14, David Stevens wrote:  
> > > > The question is then: is this approach something that is fundamentally
> > > > untenable in the kernel  
> > > 
> > > Yes. Fundamentally untenable.
> > > 
> > > Not allowing stack faults has been a wonderful simplification. It's one
> > > of those things that just plain makes the kernel easier to maintain.
> > > Saving low single digits of system memory is not exactly making me eager
> > > to go back to the harder-to-maintain days.
> > > 
> > > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > > bloat on these systems. ;)  
> > 
> > This true until, in a fleet of millions of machines, you encounter a 
> > one-in-a-billion chance of a stack overflow. You are then forced to 
> > double the statically allocated kernel stacks on every machine, paying a 
> > memory tax even though 99.999..% of threads never exceed 4K. This 
> > overhead accumulates to petabytes of wasted capacity.
> 
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.

Well, at least if we hit this rare case, we can simply double a buffer 
of pre-reserved stack memory per CPU. This still saves significant 
memory compared to wasting it on every single thread.

> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.
> That might mean that you don't need a kernel stack for all the threads
> sleeping in futex() - it might even be possible to do the retry in
> userspace saving the second kernel entry most of the time.
> It is all 'hard and difficult' though.

I was thinking about a similar approach as well—sort of multiplexing the 
kernel stacks. But honestly, when trying to cover all the edge cases, I 
didn't find it to be any better or easier than just using dynamic kernel 
stacks.

An alternative approach, which was proposed at LSFMM by Willy, is to add 
an explicit deep stack calls. When we enter a path that we know is 
exceptionally deep, only then do we extend the stack, keeping the 
default (say, 8K) everywhere else.

> The easier solution is to rewrite the system code so it doesn't have
> 1000s of threads :-)

That ship sailed in the early 90s of the previous millennium.  Nowadays, 
we have high end workstations with almost 200 hardware threads. 
Rewriting system code to reduce thread counts simply isn't an option for 
our storage machines, which have millions of threads per unit.

+CC Matthew Wilcox


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  2026-04-24 21:35   ` Pasha Tatashin
@ 2026-04-25  9:19   ` H. Peter Anvin
  1 sibling, 0 replies; 21+ messages in thread
From: H. Peter Anvin @ 2026-04-25  9:19 UTC (permalink / raw)
  To: Dave Hansen, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: linux-kernel, linux-mm

On 2026-04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
>> The question is then: is this approach something that is fundamentally
>> untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

It is worth noting that this was one of the VERY early design decisions that
has shaped Linux from the beginning:

- No swapping of kernel memory
- Kernel stacks are statically allocated
- Physical RAM is mapped into the kernel at all times
- A "monolithic" kernel using function calls, not message passing
- A kernel interface that closely maps to the low-level application API
  (e.g. each user space thread is a kernel thread.)
- Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
  in user space.

Those design decisions are, by and large, what has made Linux Linux: a
relatively simple, highly performant, and reliable system.

	-hpa



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-04-25  9:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-04-25  9:19   ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox