[PATCH v2 00/13] Dynamic Kernel Stacks

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/13] Dynamic Kernel Stacks
@ 2026-04-24 19:14 David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
                   ` (13 more replies)
  0 siblings, 14 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.

The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.

The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.

The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.

The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).

This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.

[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1

David Stevens (7):
  fork: Don't assume fully populated stack during reuse
  fork: Move vm_stack to the beginning of the stack
  fork: Move vmap stack freeing to work queue
  fork: Store task pointer in unpopulated stack ptes
  x86/entry/fred: encode frame pointer on entry
  x86: Add support for dynamic kernel stacks via FRED
  x86: Add support for dynamic kernel stacks via IST

Pasha Tatashin (6):
  fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  fork: separate vmap stack allocation and free calls
  mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
    functions
  fork: Dynamic Kernel Stacks
  task_stack.h: Add stack_not_used() support for dynamic stack
  fork: Dynamic Kernel Stack accounting

 arch/Kconfig                          |  38 ++
 arch/x86/Kconfig                      |   1 +
 arch/x86/entry/entry_64.S             |  49 ++-
 arch/x86/entry/entry_64_fred.S        |  57 +++
 arch/x86/include/asm/cpu_entry_area.h |  18 +
 arch/x86/include/asm/idtentry.h       |  38 +-
 arch/x86/include/asm/page_64_types.h  |  10 +-
 arch/x86/include/asm/pgtable_64.h     |  36 ++
 arch/x86/include/asm/processor.h      |   6 +
 arch/x86/include/asm/traps.h          |   5 +
 arch/x86/kernel/cpu/common.c          |  11 +
 arch/x86/kernel/dumpstack_64.c        |  10 +-
 arch/x86/kernel/fred.c                |  20 +-
 arch/x86/kernel/idt.c                 |  57 +--
 arch/x86/kernel/nmi.c                 |   9 +
 arch/x86/lib/usercopy.c               |   9 +
 arch/x86/mm/cpu_entry_area.c          |  17 +
 arch/x86/mm/dump_pagetables.c         |  14 +-
 arch/x86/mm/fault.c                   | 101 +++++-
 include/linux/mmzone.h                |   3 +
 include/linux/sched.h                 |  11 +-
 include/linux/sched/task_stack.h      |  48 ++-
 include/linux/vmalloc.h               |  14 +
 init/init_task.c                      |   4 +
 kernel/exit.c                         |  22 ++
 kernel/fork.c                         | 481 ++++++++++++++++++++++++--
 kernel/sched/core.c                   |   1 +
 mm/memcontrol.c                       |  10 +
 mm/vmalloc.c                          |  27 +-
 mm/vmstat.c                           |   3 +
 30 files changed, 1049 insertions(+), 81 deletions(-)

base-commit: 028ef9c96e96197026887c0f092424679298aae8
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

In many places number of pages in the stack is detremined via
(THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
(THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.

However, with dynamic stacks, the number of pages in vm_area will grow
with stack, therefore, use vm_area->nr_pages to determine the actual
number of pages allocated in stack.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, also skipped intermediary helper variable nr_pages]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index bc2bf58b93b6..8961b895bf05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -312,9 +312,7 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 	int ret;
 	int nr_charged = 0;
 
-	BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
-	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+	for (i = 0; i < vm_area->nr_pages; i++) {
 		ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
 		if (ret)
 			goto err;
@@ -484,7 +482,7 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
 		int i;
 
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < vm_area->nr_pages; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -505,7 +503,7 @@ void exit_task_stack_account(struct task_struct *tsk)
 		int i;
 
 		vm_area = task_stack_vm_area(tsk);
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		for (i = 0; i < vm_area->nr_pages; i++)
 			memcg_kmem_uncharge_page(vm_area->pages[i], 0);
 	}
 }
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

In preparation for dynamic kernel stacks, don't assume that
vm_area->nr_pages matches THREAD_SIZE when clearing a stack for reuse.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8961b895bf05..50772c0cc5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -332,6 +332,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 	vm_area = alloc_thread_stack_node_from_cache(tsk, node);
 	if (vm_area) {
+		unsigned long memset_offset = 0;
+
 		if (memcg_charge_kernel_stack(vm_area)) {
 			vfree(vm_area->addr);
 			return -ENOMEM;
@@ -343,7 +345,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
-		memset(stack, 0, THREAD_SIZE);
+		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
+			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
+		memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
 		tsk->stack_vm_area = vm_area;
 		tsk->stack = stack;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
  2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

The vm_stack struct used to free stacks via an RCU callback is stored
directly in the stack being freed. Make sure it's stored at the
beginning of the stack regardless of stack growth direction, to avoid
faults on partially allocated dynamic stacks.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 50772c0cc5da..72c081db492c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -282,7 +282,12 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
 {
-	struct vm_stack *vm_stack = tsk->stack;
+	struct vm_stack *vm_stack;
+
+	if (IS_ENABLED(CONFIG_STACK_GROWSUP))
+		vm_stack = tsk->stack;
+	else
+		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
 	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 04/13] fork: separate vmap stack allocation and free calls
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (2 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

In preparation for the dynamic stacks, separate out the
__vmalloc_node_range and vfree calls from the vmap based stack
allocations. The dynamic stacks will use their own variants of these
functions.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Fix a bug in original patch: free_vmap_stack(vm_area->addr)]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Add missing free_vmap_stack conversion, fix typos, rebase]
Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 72c081db492c..8bf32815f422 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -269,6 +269,21 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	return false;
 }
 
+static inline struct vm_struct *alloc_vmap_stack(int node)
+{
+	void *stack;
+
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+			       node, __builtin_return_address(0));
+
+	return stack ? find_vm_area(stack) : NULL;
+}
+
+static inline void free_vmap_stack(struct vm_struct *vm_area)
+{
+	vfree(vm_area->addr);
+}
+
 static void thread_stack_free_rcu(struct rcu_head *rh)
 {
 	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
@@ -277,7 +292,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
 		return;
 
-	vfree(vm_area->addr);
+	free_vmap_stack(vm_area);
 }
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -304,7 +319,7 @@ static int free_vm_stack_cache(unsigned int cpu)
 		if (!vm_area)
 			continue;
 
-		vfree(vm_area->addr);
+		free_vmap_stack(vm_area);
 		cached_vm_stack_areas[i] = NULL;
 	}
 
@@ -333,41 +348,35 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct vm_struct *vm_area;
-	void *stack;
 
 	vm_area = alloc_thread_stack_node_from_cache(tsk, node);
 	if (vm_area) {
 		unsigned long memset_offset = 0;
 
 		if (memcg_charge_kernel_stack(vm_area)) {
-			vfree(vm_area->addr);
+			free_vmap_stack(vm_area);
 			return -ENOMEM;
 		}
 
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-
-		stack = kasan_reset_tag(vm_area->addr);
+		tsk->stack = kasan_reset_tag(vm_area->addr);
 
 		/* Clear stale pointers from reused stack. */
 		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
 			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
-		memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+		memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
 		tsk->stack_vm_area = vm_area;
-		tsk->stack = stack;
 		return 0;
 	}
 
-	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
-				     GFP_VMAP_STACK,
-				     node, __builtin_return_address(0));
-	if (!stack)
+	vm_area = alloc_vmap_stack(node);
+	if (!vm_area)
 		return -ENOMEM;
 
-	vm_area = find_vm_area(stack);
 	if (memcg_charge_kernel_stack(vm_area)) {
-		vfree(stack);
+		free_vmap_stack(vm_area);
 		return -ENOMEM;
 	}
 	/*
@@ -376,8 +385,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * so cache the vm_struct.
 	 */
 	tsk->stack_vm_area = vm_area;
-	stack = kasan_reset_tag(stack);
-	tsk->stack = stack;
+	tsk->stack = kasan_reset_tag(vm_area->addr);
 	return 0;
 }
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (3 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

get_vm_area_node()
Unlike the other public get_vm_area_* variants, this one accepts node
from which to allocate data structure, and also the align, which allows
to create vm area with a specific alignment.

This call is going to be used by dynamic stacks in order to ensure that
the stack VM area of a specific alignment, and that even if there is
only one page mapped, no page table allocations are going to be needed
to map the other stack pages.

vmap_pages_range()
We will need it from kernel/fork.c in order to map the initial stack
pages, so export the function and add a forward declaration of this
function to the linux/vmalloc.h header.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Switched to vmap_pages_range instead of noflush variant, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/vmalloc.h | 14 ++++++++++++++
 mm/vmalloc.c            | 25 +++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e8e94f90d686..7b56a0b998ab 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -250,6 +250,9 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 					unsigned long flags,
 					unsigned long start, unsigned long end,
 					const void *caller);
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+				   unsigned long flags, int node, gfp_t gfp,
+				   const void *caller);
 void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
@@ -301,11 +304,22 @@ static inline void set_vm_flush_reset_perms(void *addr)
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+				  pgprot_t prot, struct page **pages, unsigned int page_shift);
+
 #else  /* !CONFIG_MMU */
 #define VMALLOC_TOTAL 0UL
 
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 static inline void set_vm_flush_reset_perms(void *addr) {}
+static inline
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+				  pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_MMU */
 
 #if defined(CONFIG_MMU) && defined(CONFIG_SMP)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a4402..39b7e118cbce 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -722,6 +722,7 @@ int vmap_pages_range(unsigned long addr, unsigned long end,
 {
 	return __vmap_pages_range(addr, end, prot, pages, page_shift, GFP_KERNEL);
 }
+EXPORT_SYMBOL_GPL(vmap_pages_range);
 
 static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
 				unsigned long end)
@@ -3285,6 +3286,30 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
 				  NUMA_NO_NODE, GFP_KERNEL, caller);
 }
 
+/**
+ * get_vm_area_node - reserve a contiguous and aligned kernel virtual area
+ * @size:	 size of the area
+ * @align:	 alignment of the start address of the area
+ * @flags:	 %VM_IOREMAP for I/O mappings
+ * @node:	 NUMA node from which to allocate the area data structure
+ * @gfp:	 Flags to pass to the allocator
+ * @caller:	 Caller to be stored in the vm area data structure
+ *
+ * Search for an area of @size/align in the kernel virtual mapping area and
+ * reserve it for our purposes. Returns the area descriptor on success or %NULL
+ * on failure.
+ *
+ * Return: the area descriptor on success or %NULL on failure.
+ */
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+				   unsigned long flags, int node, gfp_t gfp,
+				   const void *caller)
+{
+	return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+				  VMALLOC_START, VMALLOC_END,
+				  node, gfp, caller);
+}
+
 /**
  * find_vm_area - find a continuous kernel virtual area
  * @addr:	  base address
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 06/13] fork: Move vmap stack freeing to work queue
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (4 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

For vmap stacks not immediately released into the stack cache, free them
in a workqueue instead of via call_rcu(). In an RCU context, vfree
already schedules the actual freeing on the per-cpu system workqueue, so
this change only affects when exactly the second attempt to put the
stack into the stack cache occurs.

Moving freeing to a workqueue will allow for freeing dynamic stacks in a
sleepable context (for remove_vm_area), rather than relying on vfree
dispatching to a workqueue via vfree_atomic.

Signed-off-by: David Stevens <stevensd@google.com>
---
 kernel/fork.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8bf32815f422..01e0bf4f4b02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -205,7 +205,7 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 #define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
 
 struct vm_stack {
-	struct rcu_head rcu;
+	struct rcu_work work;
 	struct vm_struct *stack_vm_area;
 };
 
@@ -284,9 +284,9 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
 	vfree(vm_area->addr);
 }
 
-static void thread_stack_free_rcu(struct rcu_head *rh)
+static void thread_stack_free_work(struct work_struct *work)
 {
-	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+	struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
 	struct vm_struct *vm_area = vm_stack->stack_vm_area;
 
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
@@ -305,7 +305,8 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
-	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+	INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
+	queue_rcu_work(system_wq, &vm_stack->work);
 }
 
 static int free_vm_stack_cache(unsigned int cpu)
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 07/13] fork: Dynamic Kernel Stacks
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (5 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

The core implementation of dynamic kernel stacks.

Unlike traditional kernel stacks, these stacks auto-grow as they are
used. This allows to save a significant amount of memory in the fleet
environments. Also, potentially the default size of kernel thread can be
increased in order to prevent stack overflows without compromising on
the overall memory overhead.

The dynamic kernel stacks interface provides two global functions:

1. dynamic_stack_fault().
Architectures that support dynamic kernel stacks, must call this function
in order to handle the fault in the stack.

It allocates and maps new pages into the stack. The pages are
maintained in a per-cpu data structure.

2. dynamic_stack()
Must be called as a thread leaving CPU to check if the thread has
allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set.
If this is the case, there are two things need to be performed:
  a. Charge the thread for the allocated stack pages.
  b. refill the per-cpu array so the next thread can also fault.

Dynamic kernel threads do not support "STACK_END_MAGIC", as the last
page does not have to be faulted in. However, since they are based off
vmap stacks, the guard pages always protect the dynamic kernel stacks
from overflow.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.

Therefore, the numbers should be tested for a specific workload. From
my tests I found the following values on a freshly booted idling
machines:

CPU           #Cores #Stacks  Regular(kb) Dynamic(kb)
AMD Genoa        384    5786    92576       23388
Intel Skylake    112    3182    50912       12860
AMD Rome         128    3401    54416       14784
AMD Rome         256    4908    78528       20876
Intel Haswell     72    2644    42304       10624

On all machines dynamic kernel stacks take about 25% of the original
stack memory. Only 5% of active tasks performed a stack page fault in
their life cycles.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, used vm_area->nr_pages directly in one instance]
[Depends on !PREEMPT_RT]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Fix races around accounting]
[Use GFP_ATOMIC when executing in the scheduler]
[Depend on INIT_STACK_ALL_* config]
[Fix bugs in some error paths and edge cases]
[Don't cache partially faulted stacks]
[Added out-var to tell if address is on target stack]
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/Kconfig                     |  39 ++++
 include/linux/sched.h            |  11 +-
 include/linux/sched/task_stack.h |  47 +++-
 init/init_task.c                 |   4 +
 kernel/fork.c                    | 357 +++++++++++++++++++++++++++++--
 kernel/sched/core.c              |   1 +
 6 files changed, 439 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298e..95ded79f0825 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1515,6 +1515,45 @@ config VMAP_STACK
 	  backing virtual mappings with real shadow memory, and KASAN_VMALLOC
 	  must be enabled.
 
+config HAVE_ARCH_DYNAMIC_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  that grow dynamically.
+
+	  - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
+	    stack related page faults.
+
+	  - Arch must be able to fault from interrupt context.
+
+	  - Arch must allow the kernel to handle stack faults gracefully, even
+	    during interrupt handling.
+
+	  - Exceptions such as no pages available should be handled the same
+	    in the consistent and predictable way. I.e. the exception should be
+	    handled the same as when stack overflow occurs when guard pages are
+	    touched with extra information about the allocation error.
+
+config DYNAMIC_STACK
+	default y
+	bool "Dynamically grow kernel stacks"
+	depends on THREAD_INFO_IN_TASK
+	depends on HAVE_ARCH_DYNAMIC_STACK
+	depends on VMAP_STACK
+	depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
+	depends on !KASAN
+	depends on !DEBUG_STACK_USAGE
+	depends on !STACK_GROWSUP
+	depends on !PREEMPT_RT
+	help
+	  Dynamic kernel stacks allow to save memory on machines with a lot of
+	  threads by starting with small stacks, and grow them only when needed.
+	  On workloads where most of the stack depth do not reach over one page
+	  the memory saving can be substantial. The feature requires virtually
+	  mapped kernel stacks in order to handle page faults. It requires stack
+	  initialization to preclude one thread from faulting on another thread's
+	  stack.
+
 config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	def_bool n
 	help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf..7aa06233afd5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,11 @@ struct task_struct {
 	 */
 	randomized_struct_fields_start
 
+#ifdef CONFIG_DYNAMIC_STACK
+	unsigned long			packed_stack;
+#else
 	void				*stack;
+#endif
 	refcount_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
@@ -1563,6 +1567,11 @@ struct task_struct {
 	struct timer_list		oom_reaper_timer;
 #endif
 #ifdef CONFIG_VMAP_STACK
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_stack() can be called in interrupt context,
+	 * so cache the vm_struct.
+	 */
 	struct vm_struct		*stack_vm_area;
 #endif
 #ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -1773,7 +1782,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_DYNAMIC_STACK	0x00800000	/* This thread allocated dynamic stack pages */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 1fab7e9043a3..7dcff2836d7e 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -13,6 +13,10 @@
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define DYNAMIC_STACK_MAX_ACCOUNT_MASK  ((1 << (THREAD_SIZE_ORDER + 1)) - 1)
+#endif
+
 /*
  * When accessing the stack of a non-current task that might exit, use
  * try_get_task_stack() instead.  task_stack_page will return a pointer
@@ -20,7 +24,11 @@
  */
 static __always_inline void *task_stack_page(const struct task_struct *task)
 {
+#ifdef CONFIG_DYNAMIC_STACK
+	return (void *)(task->packed_stack & ~DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+#else
 	return task->stack;
+#endif
 }
 
 #define setup_thread_stack(new,old)	do { } while(0)
@@ -30,7 +38,7 @@ static __always_inline unsigned long *end_of_stack(const struct task_struct *tas
 #ifdef CONFIG_STACK_GROWSUP
 	return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1;
 #else
-	return task->stack;
+	return task_stack_page(task);
 #endif
 }
 
@@ -83,9 +91,45 @@ static inline void put_task_stack(struct task_struct *tsk) {}
 
 void exit_task_stack_account(struct task_struct *tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+#define task_stack_end_corrupted(task)	0
+
+#ifndef THREAD_PREALLOC_PAGES
+#define THREAD_PREALLOC_PAGES		1
+#endif
+
+#define THREAD_DYNAMIC_PAGES						\
+	((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES)
+
+void dynamic_stack_refill_pages(void);
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
+bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+
+/*
+ * Refill and charge for the used pages.
+ */
+static inline void dynamic_stack(struct task_struct *tsk)
+{
+	if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) {
+		dynamic_stack_refill_pages();
+		dynamic_stack_accounting(tsk, false);
+		tsk->flags &= ~PF_DYNAMIC_STACK;
+	}
+}
+
+static inline void set_task_stack_end_magic(struct task_struct *tsk) {}
+
+#else /* !CONFIG_DYNAMIC_STACK */
+
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
 
+void set_task_stack_end_magic(struct task_struct *tsk);
+static inline void dynamic_stack(struct task_struct *tsk) {}
+
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static inline int object_is_on_stack(const void *obj)
 {
 	void *stack = task_stack_page(current);
@@ -104,7 +148,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
 	return 0;
 }
 #endif
-extern void set_task_stack_end_magic(struct task_struct *tsk);
 
 static inline int kstack_end(void *addr)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..e3645ec4ab02 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -99,7 +99,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.stack_refcount	= REFCOUNT_INIT(1),
 #endif
 	.__state	= 0,
+#ifdef CONFIG_DYNAMIC_STACK
+	.packed_stack	= (unsigned long)init_stack,
+#else
 	.stack		= init_stack,
+#endif
 	.usage		= REFCOUNT_INIT(2),
 	.flags		= PF_KTHREAD,
 	.prio		= MAX_PRIO - 20,
diff --git a/kernel/fork.c b/kernel/fork.c
index 01e0bf4f4b02..e615ef736dc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,10 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
  * accounting is performed by the code assigning/releasing stacks to tasks.
  * We need a zeroed memory without __GFP_ACCOUNT.
  */
-#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
+static gfp_t vmap_stack_gfp(bool is_atomic)
+{
+	return (is_atomic ? GFP_ATOMIC : GFP_KERNEL) | __GFP_ZERO;
+}
 
 struct vm_stack {
 	struct rcu_work work;
@@ -241,6 +244,18 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	unsigned int i;
 	int nid;
 
+#ifdef CONFIG_DYNAMIC_STACK
+	/*
+	 * Skip the cache for populated dynamic stacks to avoid punishing a
+	 * memcg with a larger charge just because it happened to pick up a
+	 * dynamic stack that's been partially faulted in. We may get a lower
+	 * number of cache hits, but stacks with dynamically faulted pages
+	 * should be fairly uncommon.
+	 */
+	if (vm_area->nr_pages != THREAD_PREALLOC_PAGES)
+		return false;
+#endif /* CONFIG_DYNAMIC_STACK */
+
 	/*
 	 * Don't cache stacks if any of the pages don't match the local domain, unless
 	 * there is no local memory to begin with.
@@ -269,11 +284,285 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 	return false;
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * There is a window between when a thread refills the page pool and when it
+ * actually gets scheduled out where it can still consume pages from the pool.
+ * To guarantee the next thread has enough pages to fully populate its stack,
+ * double the size of the page pool.
+ */
+#define DYNSTK_PAGE_POOL_NR (THREAD_DYNAMIC_PAGES * 2)
+
+static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
+
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+}
+
+static void free_vmap_stack(struct vm_struct *vm_area)
+{
+	int i;
+
+	remove_vm_area(vm_area->addr);
+
+	for (i = 0; i < vm_area->nr_pages; i++)
+		__free_page(vm_area->pages[i]);
+
+	kfree(vm_area->pages);
+	kfree(vm_area);
+}
+
+static struct vm_struct *alloc_vmap_stack(int node)
+{
+	gfp_t gfp = vmap_stack_gfp(false);
+	unsigned long addr, end;
+	struct vm_struct *vm_area;
+	int err, i;
+
+	/*
+	 * Paranoid check to guarantee we never straddle a page table, so
+	 * that virt_to_kpte() is always valid in dynamic_stack_fault().
+	 */
+	BUILD_BUG_ON((PMD_SIZE % THREAD_SIZE) || (THREAD_ALIGN % THREAD_SIZE));
+
+	vm_area = get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node,
+				   gfp, __builtin_return_address(0));
+	if (!vm_area)
+		return NULL;
+
+	vm_area->pages = kmalloc_node(sizeof(void *) *
+				      (THREAD_SIZE >> PAGE_SHIFT), gfp, node);
+	if (!vm_area->pages)
+		goto cleanup_err;
+
+	for (i = 0; i < THREAD_PREALLOC_PAGES; i++) {
+		vm_area->pages[i] = alloc_pages(gfp, 0);
+		if (!vm_area->pages[i])
+			goto cleanup_err;
+		vm_area->nr_pages++;
+	}
+
+	addr = (unsigned long)vm_area->addr +
+					(THREAD_DYNAMIC_PAGES << PAGE_SHIFT);
+	end = (unsigned long)vm_area->addr + THREAD_SIZE;
+	err = vmap_pages_range(addr, end, PAGE_KERNEL, vm_area->pages, PAGE_SHIFT);
+	if (err)
+		goto cleanup_err;
+
+	return vm_area;
+cleanup_err:
+	free_vmap_stack(vm_area);
+	return NULL;
+}
+
+static struct page *noinstr dynamic_stack_get_page(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (!page)
+			continue;
+		pages[i] = NULL;
+		return page;
+	}
+
+	return NULL;
+}
+
+static int dynamic_stack_refill_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (pages[i])
+			continue;
+		pages[i] = alloc_pages(vmap_stack_gfp(false), 0);
+		if (unlikely(!pages[i])) {
+			pr_err("failed to allocate dynamic stack page for cpu[%d]\n",
+			       cpu);
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static int dynamic_stack_free_pages_cpu(unsigned int cpu)
+{
+	struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		if (!pages[i])
+			continue;
+		__free_page(pages[i]);
+		pages[i] = NULL;
+	}
+
+	return 0;
+}
+
+void dynamic_stack_refill_pages(void)
+{
+	struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+	int i;
+
+	for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+		struct page *page = pages[i];
+
+		if (page)
+			continue;
+
+		/*
+		 * This is called during context switch, so we can't take any
+		 * sleeping locks. As such, we need to use GFP_ATOMIC.
+		 */
+		page = alloc_pages(vmap_stack_gfp(true), 0);
+		if (unlikely(!page))
+			pr_err_ratelimited("failed to refill per-cpu dynamic stack\n");
+		pages[i] = page;
+	}
+}
+
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
+{
+	struct vm_struct *vm_area = tsk->stack_vm_area;
+	unsigned long nr_accounted, i;
+
+	cant_sleep();
+
+	/* Verify enough low order bits in the page-aligned stack pointer. */
+	BUILD_BUG_ON(THREAD_PREALLOC_PAGES == 0 ||
+		     PAGE_SIZE - 1 <= DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+
+	nr_accounted = tsk->packed_stack & DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+
+	if (nr_accounted == DYNAMIC_STACK_MAX_ACCOUNT_MASK) {
+		WARN_ON_ONCE(finalize);
+		return 0;
+	}
+
+	for (i = THREAD_PREALLOC_PAGES + nr_accounted; i < vm_area->nr_pages; i++) {
+		struct page *page = vm_area->pages[i];
+
+		int ret = memcg_kmem_charge_page(page, GFP_ATOMIC, 0);
+		/*
+		 * XXX Since stack pages were already allocated, we should never
+		 * fail charging. Therefore, we should probably induce force
+		 * charge and oom killing if charge fails.
+		 */
+		if (unlikely(ret))
+			pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n");
+
+		mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
+				      PAGE_SIZE / 1024);
+	}
+
+	if (finalize) {
+		tsk->packed_stack |= DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+	} else {
+		tsk->packed_stack &= ~DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+		tsk->packed_stack |= (i - THREAD_PREALLOC_PAGES);
+	}
+
+	return i;
+}
+
+bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
+{
+	unsigned long stack, hole_end, addr;
+	struct vm_struct *vm_area;
+	struct page *page;
+	int nr_pages;
+	pte_t *pte;
+
+	cant_sleep();
+
+	if (WARN_ON(in_nmi())) {
+		*on_stack = false;
+		return false;
+	}
+
+	/* check if address is inside the kernel stack area */
+	stack = (unsigned long)task_stack_page(tsk);
+	if (address < stack || address >= stack + THREAD_SIZE) {
+		*on_stack = false;
+		return false;
+	}
+	*on_stack = true;
+
+	vm_area = tsk->stack_vm_area;
+	if (WARN_ON_ONCE(!vm_area))
+		return false;
+
+	nr_pages = vm_area->nr_pages;
+
+	/* Check if fault address is within the stack hole */
+	hole_end = stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT);
+	if (address >= hole_end)
+		return false;
+
+	/*
+	 * Most likely we faulted in the page right next to the last mapped
+	 * page in the stack, however, it is possible (but very unlikely) that
+	 * the faulted page is actually skips some pages in the stack. Make sure
+	 * we do not create  more than one holes in the stack, and map every
+	 * page between the current fault  address and the last page that is
+	 * mapped in the stack.
+	 */
+	address = PAGE_ALIGN_DOWN(address);
+	for (addr = hole_end - PAGE_SIZE; addr >= address; addr -= PAGE_SIZE) {
+		/* Take the next page from the per-cpu list */
+		page = dynamic_stack_get_page();
+		if (!page) {
+			instrumentation_begin();
+			pr_emerg("Failed to allocate a page during kernel_stack_fault\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Add the new page entry to the page table */
+		pte = virt_to_kpte(addr);
+		if (!pte) {
+			instrumentation_begin();
+			pr_emerg("The PTE page table for a kernel stack is not found\n");
+			instrumentation_end();
+			return false;
+		}
+
+		/* Make sure there are no existing mappings at this address */
+		if (pte_present(*pte)) {
+			instrumentation_begin();
+			pr_emerg("The PTE contains a mapping\n");
+			instrumentation_end();
+			return false;
+		}
+		set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+
+		/* Store the new page in the stack's vm_area */
+		vm_area->pages[nr_pages] = page;
+		vm_area->nr_pages = ++nr_pages;
+	}
+
+	/* Refill the pcp stack pages during context switch */
+	tsk->flags |= PF_DYNAMIC_STACK;
+
+	return true;
+}
+
+#else /* !CONFIG_DYNAMIC_STACK */
 static inline struct vm_struct *alloc_vmap_stack(int node)
 {
 	void *stack;
 
-	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, vmap_stack_gfp(false),
 			       node, __builtin_return_address(0));
 
 	return stack ? find_vm_area(stack) : NULL;
@@ -284,6 +573,13 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
 	vfree(vm_area->addr);
 }
 
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+	tsk->stack_vm_area = vm_area;
+	tsk->stack = kasan_reset_tag(vm_area->addr);
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
 static void thread_stack_free_work(struct work_struct *work)
 {
 	struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
@@ -300,9 +596,9 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 	struct vm_stack *vm_stack;
 
 	if (IS_ENABLED(CONFIG_STACK_GROWSUP))
-		vm_stack = tsk->stack;
+		vm_stack = task_stack_page(tsk);
 	else
-		vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
+		vm_stack = task_stack_page(tsk) + THREAD_SIZE - sizeof(*vm_stack);
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
 	INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
@@ -361,14 +657,13 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 		/* Reset stack metadata. */
 		kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-		tsk->stack = kasan_reset_tag(vm_area->addr);
+		link_vmap_stack_to_task(tsk, vm_area);
 
 		/* Clear stale pointers from reused stack. */
 		if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
 			memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
-		memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+		memset(task_stack_page(tsk) + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
 
-		tsk->stack_vm_area = vm_area;
 		return 0;
 	}
 
@@ -380,22 +675,20 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		free_vmap_stack(vm_area);
 		return -ENOMEM;
 	}
-	/*
-	 * We can't call find_vm_area() in interrupt context, and
-	 * free_thread_stack() can be called in interrupt context,
-	 * so cache the vm_struct.
-	 */
-	tsk->stack_vm_area = vm_area;
-	tsk->stack = kasan_reset_tag(vm_area->addr);
+	link_vmap_stack_to_task(tsk, vm_area);
 	return 0;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	if (!try_release_thread_stack_to_cache(tsk->stack_vm_area))
+	if (!try_release_thread_stack_to_cache(task_stack_vm_area(tsk)))
 		thread_stack_delayed_free(tsk);
 
+#ifdef CONFIG_DYNAMIC_STACK
+	tsk->packed_stack = 0;
+#else
 	tsk->stack = NULL;
+#endif
 	tsk->stack_vm_area = NULL;
 }
 
@@ -498,9 +791,27 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		struct vm_struct *vm_area = task_stack_vm_area(tsk);
-		int i;
+		int i, nr_accounted;
 
-		for (i = 0; i < vm_area->nr_pages; i++)
+#ifdef CONFIG_DYNAMIC_STACK
+		/*
+		 * For the exit path, resolve any pending accounting to avoid
+		 * underflow. Finalize to skip accounting for any faults that
+		 * happen between here and this thread's final __schedule()
+		 * call in do_task_dead().
+		 */
+		if (account < 0) {
+			preempt_disable();
+			nr_accounted = dynamic_stack_accounting(tsk, true);
+			preempt_enable();
+		} else {
+			nr_accounted = THREAD_PREALLOC_PAGES;
+		}
+#else
+		nr_accounted = vm_area->nr_pages;
+#endif
+
+		for (i = 0; i < nr_accounted; i++)
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
@@ -901,6 +1212,16 @@ void __init fork_init(void)
 			  NULL, free_vm_stack_cache);
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack",
+			  dynamic_stack_refill_pages_cpu,
+			  dynamic_stack_free_pages_cpu);
+	/*
+	 * Fill the dynamic stack pages for the boot CPU, others will be filled
+	 * as CPUs are onlined.
+	 */
+	dynamic_stack_refill_pages_cpu(smp_processor_id());
+#endif
 	scs_init();
 
 	lockdep_init_task(&init_task);
@@ -914,6 +1235,7 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
 	return 0;
 }
 
+#ifndef CONFIG_DYNAMIC_STACK
 void set_task_stack_end_magic(struct task_struct *tsk)
 {
 	unsigned long *stackend;
@@ -921,6 +1243,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 }
+#endif
 
 static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..417269a86973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6783,6 +6783,7 @@ static void __sched notrace __schedule(int sched_mode)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
+	dynamic_stack(prev);
 	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (6 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

CONFIG_DEBUG_STACK_USAGE is enabled by default on most architectures.

Its purpose is to determine and print the maximum stack depth on
thread exit.

The way it works, is it starts from the bottom of the stack and
searches the first non-zero word in the stack. With dynamic stack it
does not work very well, as it means it faults every pages in every
stack.

Instead, add a specific version of stack_not_used() for dynamic stacks
where instead of starting from the bottom of the stack, we start from
the last page mapped in the stack.

In addition to not doing unnecessary page faulting, this search is
optimized by skipping search through zero pages.

Also, because dynamic stack does not end with MAGIC_NUMBER, there is
no need to skip the bottom most word in the stack.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, Kasan oneliner needed preserving, rewrote a bit due to bugs]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Handle init_task's use of init_stack, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/Kconfig  |  1 -
 kernel/exit.c | 22 ++++++++++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 95ded79f0825..beffe7e01296 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1542,7 +1542,6 @@ config DYNAMIC_STACK
 	depends on VMAP_STACK
 	depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
 	depends on !KASAN
-	depends on !DEBUG_STACK_USAGE
 	depends on !STACK_GROWSUP
 	depends on !PREEMPT_RT
 	help
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117fa7d4..6caf4030e8f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -71,6 +71,7 @@
 #include <linux/unwind_deferred.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
+#include <linux/vmalloc.h>
 
 #include <uapi/linux/wait.h>
 
@@ -791,6 +792,26 @@ unsigned long stack_not_used(struct task_struct *p)
 	return (unsigned long)end_of_stack(p) - (unsigned long)n;
 }
 #else /* !CONFIG_STACK_GROWSUP */
+#ifdef CONFIG_DYNAMIC_STACK
+unsigned long stack_not_used(struct task_struct *p)
+{
+	struct vm_struct *vm_area = task_stack_vm_area(p);
+	unsigned long stack = (unsigned long)task_stack_page(p);
+	unsigned long alloc_size, *n;
+
+	/* This is NULL only for init_task, where init_stack is fully allocated. */
+	if (likely(vm_area))
+		alloc_size = vm_area->nr_pages << PAGE_SHIFT;
+	else
+		alloc_size = THREAD_SIZE;
+	n = (unsigned long *)(stack + THREAD_SIZE - alloc_size);
+
+	while (!*n)
+		n++;
+
+	return (unsigned long)n - stack;
+}
+#else
 unsigned long stack_not_used(struct task_struct *p)
 {
 	unsigned long *n = end_of_stack(p);
@@ -801,6 +822,7 @@ unsigned long stack_not_used(struct task_struct *p)
 
 	return (unsigned long)n - (unsigned long)end_of_stack(p);
 }
+#endif /* CONFIG_DYNAMIC_STACK */
 #endif /* CONFIG_STACK_GROWSUP */
 
 /* Count the maximum pages reached in kernel stacks */
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (7 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

From: Pasha Tatashin <pasha.tatashin@soleen.com>

Add an accounting of the amount of stack pages that have been faulted in
and are currently in use.

Example use case:
  $ cat /proc/vmstat | grep stack
  nr_kernel_stack 18684
  nr_dynamic_stacks_faults 156

The above shows that the kernel stacks use total 18684KiB, out of which
156KiB were faulted in.

Given that the pre-allocated stacks are 4KiB, we can determine the total
number of tasks:

tasks = (nr_kernel_stack - nr_dynamic_stacks_faults) / 4 = 4632.

The amount of kernel stack memory without dynamic stack on this machine
would be:

4632 * 16 KiB = 74,112 KiB

Therefore, in this example dynamic stacks save: 55,428 KiB

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[add to memcg stats, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/mmzone.h |  3 +++
 kernel/fork.c          | 12 +++++++++++-
 mm/memcontrol.c        | 10 ++++++++++
 mm/vmstat.c            |  3 +++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..4458fa7016a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -221,6 +221,9 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
+#ifdef CONFIG_DYNAMIC_STACK
+	NR_DYNAMIC_STACKS_FAULTS_KB, /* KiB of faulted kernel stack memory */
+#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e615ef736dc0..9ac9d23f5f4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,6 +463,8 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
 
 		mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
 				      PAGE_SIZE / 1024);
+		mod_lruvec_page_state(page, NR_DYNAMIC_STACKS_FAULTS_KB,
+				      PAGE_SIZE / 1024);
 	}
 
 	if (finalize) {
@@ -811,9 +813,17 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 		nr_accounted = vm_area->nr_pages;
 #endif
 
-		for (i = 0; i < nr_accounted; i++)
+		for (i = 0; i < nr_accounted; i++) {
 			mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
+#ifdef CONFIG_DYNAMIC_STACK
+			if (i >= THREAD_PREALLOC_PAGES) {
+				mod_lruvec_page_state(vm_area->pages[i],
+						      NR_DYNAMIC_STACKS_FAULTS_KB,
+						      account * (PAGE_SIZE / 1024));
+			}
+#endif
+		}
 	} else {
 		void *stack = task_stack_page(tsk);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cd2195a735ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -318,6 +318,9 @@ static const unsigned int memcg_node_stat_items[] = {
 	NR_FILE_THPS,
 	NR_ANON_THPS,
 	NR_KERNEL_STACK_KB,
+#ifdef CONFIG_DYNAMIC_STACK
+	NR_DYNAMIC_STACKS_FAULTS_KB,
+#endif
 	NR_PAGETABLE,
 	NR_SECONDARY_PAGETABLE,
 #ifdef CONFIG_SWAP
@@ -1403,6 +1406,10 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
+
+#ifdef CONFIG_DYNAMIC_STACK
+	{ "dynamic_stack_faults",	NR_DYNAMIC_STACKS_FAULTS_KB     },
+#endif
 };
 
 /* The actual unit of the state item, not the same as the output unit */
@@ -1415,6 +1422,9 @@ static int memcg_page_state_unit(int item)
 	case NR_SLAB_UNRECLAIMABLE_B:
 		return 1;
 	case NR_KERNEL_STACK_KB:
+#ifdef CONFIG_DYNAMIC_STACK
+	case NR_DYNAMIC_STACKS_FAULTS_KB:
+#endif
 		return SZ_1K;
 	default:
 		return PAGE_SIZE;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..8fa1c7bcbaea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1256,6 +1256,9 @@ const char * const vmstat_text[] = {
 	[I(NR_FOLL_PIN_ACQUIRED)]		= "nr_foll_pin_acquired",
 	[I(NR_FOLL_PIN_RELEASED)]		= "nr_foll_pin_released",
 	[I(NR_KERNEL_STACK_KB)]			= "nr_kernel_stack",
+#ifdef CONFIG_DYNAMIC_STACK
+	[I(NR_DYNAMIC_STACKS_FAULTS_KB)]	= "nr_dynamic_stacks_faults",
+#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	[I(NR_KERNEL_SCS_KB)]			= "nr_shadow_call_stack",
 #endif
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (8 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Store the task pointer in the ptes of the unpopulated pages of dynamic
stacks, to allow the vm_struct pointer to be retrieved without relying
on any locks or current.

This relies on being able to pack the struct task_struct pointer into a
pte. Since the struct is 64 byte aligned, that gives 5 bits of leeway,
which should be viable on most architectures.  Any architecture which
enables dynamic thread stacks must provide make_data_kpte() and
unpack_data_kpte(), which pack/unpack a right shifted pointer value
into/from a pte.

Signed-off-by: David Stevens <stevensd@google.com>
---
 include/linux/sched/task_stack.h |  1 +
 kernel/fork.c                    | 74 +++++++++++++++++++++++++++++---
 mm/vmalloc.c                     |  2 +-
 3 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 7dcff2836d7e..7cf00ce97f7c 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -105,6 +105,7 @@ void exit_task_stack_account(struct task_struct *tsk);
 void dynamic_stack_refill_pages(void);
 unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
 bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+struct task_struct *task_from_stack_address(unsigned long address);
 
 /*
  * Refill and charge for the used pages.
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ac9d23f5f4b..733fc1f58b8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -296,16 +296,40 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 
 static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
 
+#define TASK_PTR_SHIFT (ilog2(__alignof__(struct task_struct)))
+
 static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
 {
+	int i;
+	unsigned long addr;
+	pte_t *ptep, pte;
+
+	pte = make_data_kpte(((unsigned long)tsk) >> TASK_PTR_SHIFT);
+
 	tsk->stack_vm_area = vm_area;
 	tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+
+	addr = (unsigned long)vm_area->addr;
+	ptep = virt_to_kpte(addr);
+	for (i = vm_area->nr_pages; i < THREAD_SIZE >> PAGE_SHIFT;
+	     i++, addr += PAGE_SIZE, ptep++)
+		set_pte_at(&init_mm, addr, ptep, pte);
 }
 
-static void free_vmap_stack(struct vm_struct *vm_area)
+static void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
 {
 	int i;
 
+	/* Clear data kptes since vunmap expects present or none. */
+	if (was_mapped) {
+		unsigned long addr = (unsigned long)vm_area->addr;
+		pte_t *ptep = virt_to_kpte(addr);
+		unsigned int nr_to_clear = (THREAD_SIZE >> PAGE_SHIFT) - vm_area->nr_pages;
+
+		if (nr_to_clear)
+			clear_ptes(&init_mm, addr, ptep, nr_to_clear);
+	}
+
 	remove_vm_area(vm_area->addr);
 
 	for (i = 0; i < vm_area->nr_pages; i++)
@@ -354,7 +378,7 @@ static struct vm_struct *alloc_vmap_stack(int node)
 
 	return vm_area;
 cleanup_err:
-	free_vmap_stack(vm_area);
+	free_vmap_stack(vm_area, false);
 	return NULL;
 }
 
@@ -477,6 +501,42 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
 	return i;
 }
 
+noinstr struct task_struct *task_from_stack_address(unsigned long address)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	BUILD_BUG_ON((BITS_PER_LONG - TASK_PTR_SHIFT) > KPTE_AVAILABLE_DATA_BITS);
+
+	if (!is_vmalloc_addr((void *)address))
+		return NULL;
+
+	pgd = pgd_offset_k(address);
+	if (pgd_none(*pgd) || pgd_leaf(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d) || p4d_leaf(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, address);
+	if (pud_none(*pud) || pud_leaf(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || pmd_leaf(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_present(*pte) || pte_none(*pte))
+		return NULL;
+
+	return (struct task_struct *)(unpack_data_kpte(*pte) << TASK_PTR_SHIFT);
+}
+
 bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
 {
 	unsigned long stack, hole_end, addr;
@@ -570,7 +630,7 @@ static inline struct vm_struct *alloc_vmap_stack(int node)
 	return stack ? find_vm_area(stack) : NULL;
 }
 
-static inline void free_vmap_stack(struct vm_struct *vm_area)
+static inline void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
 {
 	vfree(vm_area->addr);
 }
@@ -590,7 +650,7 @@ static void thread_stack_free_work(struct work_struct *work)
 	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
 		return;
 
-	free_vmap_stack(vm_area);
+	free_vmap_stack(vm_area, true);
 }
 
 static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -618,7 +678,7 @@ static int free_vm_stack_cache(unsigned int cpu)
 		if (!vm_area)
 			continue;
 
-		free_vmap_stack(vm_area);
+		free_vmap_stack(vm_area, true);
 		cached_vm_stack_areas[i] = NULL;
 	}
 
@@ -653,7 +713,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		unsigned long memset_offset = 0;
 
 		if (memcg_charge_kernel_stack(vm_area)) {
-			free_vmap_stack(vm_area);
+			free_vmap_stack(vm_area, true);
 			return -ENOMEM;
 		}
 
@@ -674,7 +734,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		return -ENOMEM;
 
 	if (memcg_charge_kernel_stack(vm_area)) {
-		free_vmap_stack(vm_area);
+		free_vmap_stack(vm_area, true);
 		return -ENOMEM;
 	}
 	link_vmap_stack_to_task(tsk, vm_area);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39b7e118cbce..76955c101180 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -76,7 +76,7 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
-bool is_vmalloc_addr(const void *x)
+noinstr bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (9 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Add missing ENCODE_FRAME_POINTER macro invocation into FRED_ENTER macro,
to prevent the unwinder from encountering a NULL stack frame pointer
when CONFIG_UNWINDER_FRAME_POINTER is enabled

Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/entry/entry_64_fred.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..119b8214748e 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -7,6 +7,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/asm.h>
+#include <asm/frame.h>
 #include <asm/fred.h>
 #include <asm/segment.h>
 
@@ -19,6 +20,7 @@
 	UNWIND_HINT_END_OF_STACK
 	ANNOTATE_NOENDBR
 	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	movq	%rsp, %rdi	/* %rdi -> pt_regs */
 .endm
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (10 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

Add support for dynamic kernel stack faults by handling #PFs from CPL 0
on stack level 1. Since we can't sleep while on a per-CPU stack, any
page faults that didn't originate in an atomic context need to be
bounced back to the originating stack.

With dynamic kernel stacks, the processor pushing data onto the kernel
thread stack can cause a page fault. The SDM says in the #DF section
that the processor should be able to handle these exceptions serially.
However, this does not seem to actually be handled reliably.

With KVM, I've observed timer interrupts dropped. The corresponding bit
in VIRR is cleared and the ISR bit in the APIC is set before the #PF is
delivered, but the interrupt handler is not invoked after the kernel
stack fault is resolved. On bare metal, I've observed frequent hangs due
to threads getting stuck on folio_wait_bit_common. I haven't traced this
to an exact interrupt being lost, but moving interrupts to stack level 1
reduces boot failures from >10% to 0 in 1000s of attempts.

To work around this, external interrupts are also moved to stack level
1, and unconditionally bounced back to the originating stack.

Bouncing page faults and external interrupts through stack level 1 while
in CPL 0 adds a small but non-trivial overhead to those paths. The
shared entry point for events received in CPL 0 also becomes slightly
more expensive, due to the need to detect page faults and external
interrupts.

Since enabling HAVE_ARCH_DYNAMIC_STACK requires unconditional support,
enabling the config is done in the next patch that adds dynamic stack
support for traditional interrupt delivery.

Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/entry/entry_64_fred.S    | 55 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 36 ++++++++++++++++++++
 arch/x86/include/asm/traps.h      |  5 +++
 arch/x86/kernel/fred.c            | 20 ++++++++---
 arch/x86/mm/dump_pagetables.c     | 14 +++++---
 arch/x86/mm/fault.c               | 53 +++++++++++++++++++++++++++++
 6 files changed, 174 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 119b8214748e..7202655ef662 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -54,7 +54,62 @@ SYM_CODE_END(asm_fred_entrypoint_user)
 	.org asm_fred_entrypoint_user + 256, 0xcc
 SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
 	FRED_ENTER
+
+#ifdef CONFIG_DYNAMIC_STACK
+	/* Extract event type and vector from augmented SS. */
+	movl	(SS + 4)(%rsp), %esi
+	andl	$0x000f00ff, %esi
+
+	/* Check if event type is hardware exception and vector is #PF. */
+	cmpl	$0x0003000e, %esi
+	jne	.Lcheck_for_extint
+
+	call	handle_dynamic_stack_kernel_faults
+	testq	%rax, %rax
+	jz	.Lentrypoint_done
+	cmpq	%rax, %rsp
+	je	.Lskip_stack_switch
+	jmp	.Ldo_stack_switch
+
+.Lcheck_for_extint:
+	/* Check if event type is external interrupt. */
+	andl	$0xf0000, %esi
+	testl	%esi, %esi
+	jne	.Lcall_primary_entry
+	call	switch_to_kstack
+
+.Ldo_stack_switch:
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * We should only do a stack switch for an external interrupt or a page
+	 * fault in a non-atomic context. These should only ever happen in user
+	 * space or from a regular kernel stack (i.e. CSL == 0).
+	 */
+	movw	(CS + 2)(%rsp), %si
+	testw	$0x3, %si
+	jz	.Lcsl_ok
+	ud2
+.Lcsl_ok:
+#endif
+	movq	%rax, %rsp
+
+	UNWIND_HINT_REGS
+	ENCODE_FRAME_POINTER
+
+	mov	$MSR_IA32_FRED_CONFIG, %ecx
+	rdmsr
+	andl	$~0x3, %eax
+	wrmsr
+
+	movq	%rsp, %rdi
+#endif
+
+.Lskip_stack_switch:
+	movq	%rsp, %rdi
+.Lcall_primary_entry:
 	call	fred_entry_from_kernel
+
+.Lentrypoint_done:
 	FRED_EXIT
 	ERETS
 SYM_CODE_END(asm_fred_entrypoint_kernel)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index ce45882ccd07..fbb042c89d13 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -237,6 +237,42 @@ static inline void native_pgd_clear(pgd_t *pgd)
 #define __swp_entry_to_pte(x)		(__pte((x).val))
 #define __swp_entry_to_pmd(x)		(__pmd((x).val))
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * Skip the present bit. And skip dirty and accessed bits due to
+ * erratum where they can be incorrectly set on non-present ptes.
+ *
+ * Also skip bit 8, which is used for pte_present for PROT_NONE. This
+ * isn't necessary in the strictest sense since PROT_NONE doesn't apply
+ * to kernel PTEs, but it's easier to let pte_present just continue
+ * to work.
+ */
+#define KPTE_AVAILABLE_DATA_BITS 58
+
+static inline pte_t make_data_kpte(unsigned long val)
+{
+	unsigned long low_part, mid_part, high_part;
+
+	low_part = (val & 0xf) << 1;
+	mid_part = (val & 0x10) << 3;
+	high_part = (val & ~0x1f) << 4;
+
+	return __pte(low_part | mid_part | high_part);
+}
+
+static inline unsigned long unpack_data_kpte(pte_t pte)
+{
+	unsigned long val = pte_val(pte), high_part, mid_part, low_part;
+
+	low_part = (val >> 1) & 0xf;
+	mid_part = (val >> 3) & 0x10;
+	high_part = (val >> 4) & ~0x1f;
+
+	return low_part | mid_part | high_part;
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
 extern void cleanup_highmap(void);
 
 #define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 3f24cc472ce9..6b55eb91aea6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,6 +15,11 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
 struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+
+#ifdef CONFIG_DYNAMIC_STACK
+asmlinkage __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs);
+asmlinkage __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs);
+#endif
 #endif
 
 extern int ibt_selftest(void);
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index e736b19e18de..01d727420d1f 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -9,6 +9,8 @@
 
 /* #DB in the kernel would imply the use of a kernel debugger. */
 #define FRED_DB_STACK_LEVEL		1UL
+#define FRED_PF_STACK_LEVEL		1UL
+#define FRED_INT_STACK_LEVEL		1UL
 #define FRED_NMI_STACK_LEVEL		2UL
 #define FRED_MC_STACK_LEVEL		2UL
 /*
@@ -25,6 +27,11 @@
 DEFINE_PER_CPU(unsigned long, fred_rsp0);
 EXPORT_PER_CPU_SYMBOL(fred_rsp0);
 
+#define FRED_CONFIG_VAL(int_stklvl) \
+	(FRED_CONFIG_REDZONE /* Reserve for CALL emulation */ | \
+	 FRED_CONFIG_INT_STKLVL(int_stklvl) | \
+	 FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user))
+
 void cpu_init_fred_exceptions(void)
 {
 	/* When FRED is enabled by default, remove this log message */
@@ -44,11 +51,7 @@ void cpu_init_fred_exceptions(void)
 	 */
 	loadsegment(ss, __KERNEL_DS);
 
-	wrmsrq(MSR_IA32_FRED_CONFIG,
-	       /* Reserve for CALL emulation */
-	       FRED_CONFIG_REDZONE |
-	       FRED_CONFIG_INT_STKLVL(0) |
-	       FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
+	wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(0));
 
 	wrmsrq(MSR_IA32_FRED_STKLVLS, 0);
 
@@ -84,8 +87,15 @@ void cpu_init_fred_rsps(void)
 	       FRED_STKLVL(X86_TRAP_DB,  FRED_DB_STACK_LEVEL) |
 	       FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) |
 	       FRED_STKLVL(X86_TRAP_MC,  FRED_MC_STACK_LEVEL) |
+#ifdef CONFIG_DYNAMIC_STACK
+	       FRED_STKLVL(X86_TRAP_PF,  FRED_PF_STACK_LEVEL) |
+#endif
 	       FRED_STKLVL(X86_TRAP_DF,  FRED_DF_STACK_LEVEL));
 
+#ifdef CONFIG_DYNAMIC_STACK
+	wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(FRED_INT_STACK_LEVEL));
+#endif
+
 	/* The FRED equivalents to IST stacks... */
 	wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
 	wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2afa7a23340e..5c33c33e93fe 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -306,11 +306,17 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
 	static const char units[] = "BKMGTPE";
 	struct seq_file *m = st->seq;
 
-	new_prot = val & PTE_FLAGS_MASK;
-	if (!val)
+	/* Ignore prot/eff from data kptes. */
+	if (val & _PAGE_PRESENT || addr < address_markers[KERNEL_SPACE_NR].start_address) {
+		new_prot = val & PTE_FLAGS_MASK;
+		if (!val)
+			new_eff = 0;
+		else
+			new_eff = st->prot_levels[level];
+	} else {
+		new_prot = 0;
 		new_eff = 0;
-	else
-		new_eff = st->prot_levels[level];
+	}
 
 	/*
 	 * If we have a "break" in the series, we need to flush the state that
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b83a06739b51..40d518d9f562 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1480,6 +1480,59 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 	local_irq_disable();
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+{
+	unsigned long new_sp;
+	unsigned long data_len;
+
+	new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+	new_sp &= FRED_STACK_FRAME_RSP_MASK;
+	data_len = sizeof(struct fred_frame);
+	new_sp -= data_len;
+
+	memcpy((void *)new_sp, regs, data_len);
+
+	return new_sp;
+}
+
+__visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
+{
+	return copy_stack_data(regs);
+}
+
+#define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
+
+__visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs)
+{
+	unsigned long address;
+	struct task_struct *tsk;
+	bool on_stack;
+
+	address = fred_event_data(regs);
+	if (fault_in_kernel_space(address) && !in_nmi()) {
+		tsk = task_from_stack_address(address);
+
+		if (tsk && dynamic_stack_fault(tsk, address, &on_stack)) {
+			WARN_ON_ONCE(tsk != current &&
+				     ALIGN_TO_STACK(regs->sp) != ALIGN_TO_STACK(address));
+			return 0;
+		}
+	}
+
+	/*
+	 * The regular fault handler won't sleep when executing in an
+	 * atomic context, so we can complete the #PF directly on the
+	 * #PF stack.
+	 */
+	if (in_atomic())
+		return (unsigned long)regs;
+	else
+		return copy_stack_data(regs);
+}
+#endif
+
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
 	irqentry_state_t state;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (11 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
@ 2026-04-24 19:14 ` David Stevens
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  13 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

On hardware that doesn't support FRED, use ISTs to support dynamic
kernel stacks. In the same way as we do when using FRED, any regular #PF
gets manually moved back onto the original stack. Additionally, we take
the similar approach as we do with FRED to avoid issues with interrupt
re-delivery and handle external interrupts on an IST stack.

The fact that IST stacks aren't reentrant means we have to be very
careful to avoid triggering a #PF while the #PF IST is being used. Since
NMIs can trigger #PFs, we have the NMI handler temporarily install a
secondary #PF IST stack if it detects it came from the #PF IST stack, to
avoid clobbering that stack. Note that although iret unmasking of NMIs
can cause us to get a second NMI while an NMI is on the #PF IST stack,
the actual handling of that secondary NMI will be delayed until after
the original NMI (and thus the #PF) is resolved. As such, one extra #PF
IST stack is sufficient to resolve reentrancy issues with respect to
NMIs.

For #DB exceptions, we make sure that all code that executes on the #PF
IST stack is noinstr. Unfortunately this is not 100% bulletproof, since
the handler needs to access data outside of cpu_entry_area (e.g.
current, current's stack, vmap stack page tables), and the user could
have set hardware breakpoints on accesses to those addresses. Rather
than handle this edge case that should only occur during manual
debugging, we just detect reentrancy on the #PF IST and abort.

It is possible for #MCE to occur on the #PF IST stack, but the #MCE
handler shouldn't generate new #PFs. The reentrancy check on the #PF
stack will trigger if any recoverable #MCEs do generate #PFs - if there
are actually reports of it happening, we can address it then.

Bouncing all #PF and external interrupts through IST stacks adds some
overhead. However, such events from userspace already had to bounce
through the CPU entry stack, so introducing ISTs only adds notable
overhead for #PFs and external interrupts that occur while in CPL 0.

Signed-off-by: David Stevens <stevensd@google.com>
---
 arch/x86/Kconfig                      |  1 +
 arch/x86/entry/entry_64.S             | 49 +++++++++++++++++--
 arch/x86/include/asm/cpu_entry_area.h | 18 +++++++
 arch/x86/include/asm/idtentry.h       | 38 ++++++++++++++-
 arch/x86/include/asm/page_64_types.h  | 10 +++-
 arch/x86/include/asm/processor.h      |  6 +++
 arch/x86/kernel/cpu/common.c          | 11 +++++
 arch/x86/kernel/dumpstack_64.c        | 10 +++-
 arch/x86/kernel/idt.c                 | 57 +++++++++++++---------
 arch/x86/kernel/nmi.c                 |  9 ++++
 arch/x86/lib/usercopy.c               |  9 ++++
 arch/x86/mm/cpu_entry_area.c          | 17 +++++++
 arch/x86/mm/fault.c                   | 70 ++++++++++++++++++++++-----
 13 files changed, 262 insertions(+), 43 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..182fda721b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -212,6 +212,7 @@ config X86
 	select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
 	select HAVE_ARCH_USERFAULTFD_MINOR	if X86_64 && USERFAULTFD
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_DYNAMIC_STACK		if X86_64 && !XEN_PV
 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_ASM_MODVERSIONS
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..02dbd00cc4bb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -286,7 +286,7 @@ SYM_CODE_END(xen_error_entry)
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
  */
-.macro idtentry_body cfunc has_error_code:req
+.macro idtentry_body cfunc has_error_code:req kernel_reentry_fn=
 
 	/*
 	 * Call error_entry() and switch to the task stack if from userspace.
@@ -302,6 +302,38 @@ SYM_CODE_END(xen_error_entry)
 	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+	/*
+	 * For entry from userspace, we've also already moved off of
+	 * the IST after calling error_entry above.
+	 */
+	testb	$3, CS(%rsp)
+	jnz	.Lregular_fault_\cfunc
+
+	/* Check and set the reentry canary reserved by IST_ENTRY_OFFSET. */
+	cmpq	$0, (SS + 8)(%rsp)
+	jne	.List_reentry_abort_\cfunc
+	movq	$1, (SS + 8)(%rsp)
+
+	movq	%rsp, %rdi
+	call	\kernel_reentry_fn
+
+	movq	$0, (SS + 8)(%rsp)
+
+	testq	%rax, %rax
+	jnz	.Lchange_stack_\cfunc
+	jmp	error_return
+
+.Lchange_stack_\cfunc:
+	movq	%rax, %rsp
+
+	ENCODE_FRAME_POINTER
+	UNWIND_HINT_REGS
+.Lregular_fault_\cfunc:
+.endif
+#endif
+
 	movq	%rsp, %rdi			/* pt_regs pointer into 1st argument*/
 
 	.if \has_error_code == 1
@@ -314,6 +346,13 @@ SYM_CODE_END(xen_error_entry)
 	call	\cfunc
 
 	jmp	error_return
+
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+.List_reentry_abort_\cfunc:
+	ud2
+.endif
+#endif
 .endm
 
 /**
@@ -322,11 +361,13 @@ SYM_CODE_END(xen_error_entry)
  * @asmsym:		ASM symbol for the entry point
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
+ * @kernel_reentry_fn:  If set, C function to be called on re-entry from
+ *			kernel space before the main handler is invoked.
  *
  * The macro emits code to set up the kernel context for straight forward
  * and simple IDT entries. No IST stack, no paranoid entry checks.
  */
-.macro idtentry vector asmsym cfunc has_error_code:req
+.macro idtentry vector asmsym cfunc has_error_code:req kernel_reentry_fn=
 SYM_CODE_START(\asmsym)
 
 	.if \vector == X86_TRAP_BP
@@ -358,7 +399,7 @@ SYM_CODE_START(\asmsym)
 .Lfrom_usermode_no_gap_\@:
 	.endif
 
-	idtentry_body \cfunc \has_error_code
+	idtentry_body \cfunc \has_error_code \kernel_reentry_fn
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -375,7 +416,7 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_irq vector cfunc
 	.p2align CONFIG_X86_L1_CACHE_SHIFT
-	idtentry \vector asm_\cfunc \cfunc has_error_code=1
+	idtentry \vector asm_\cfunc \cfunc has_error_code=1 kernel_reentry_fn=switch_to_kstack
 .endm
 
 /**
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..5bce3259edee 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -26,6 +26,12 @@
 	char	DB_stack[EXCEPTION_STKSZ];			\
 	char	MCE_stack_guard[guardsize];			\
 	char	MCE_stack[EXCEPTION_STKSZ];			\
+	char	PF_stack_guard[guardsize];			\
+	char	PF_stack[EXCEPTION_STKSZ];			\
+	char	PF2_stack_guard[guardsize];			\
+	char	PF2_stack[EXCEPTION_STKSZ];			\
+	char	UDI_stack_guard[guardsize];			\
+	char	UDI_stack[EXCEPTION_STKSZ];			\
 	char	VC_stack_guard[guardsize];			\
 	char	VC_stack[optional_stack_size];			\
 	char	VC2_stack_guard[guardsize];			\
@@ -50,6 +56,9 @@ enum exception_stack_ordering {
 	ESTACK_NMI,
 	ESTACK_DB,
 	ESTACK_MCE,
+	ESTACK_PF,
+	ESTACK_PF2,
+	ESTACK_UDI,
 	ESTACK_VC,
 	ESTACK_VC2,
 	N_EXCEPTION_STACKS
@@ -144,6 +153,15 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
 	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
 }
 
+#ifdef CONFIG_DYNAMIC_STACK
+bool is_pf_ist_stack(unsigned long addr);
+#else
+static inline bool is_pf_ist_stack(unsigned long addr)
+{
+	return false;
+}
+#endif
+
 #define __this_cpu_ist_top_va(name)					\
 	CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 42bf6a58ec36..d8c846d28a1d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -163,6 +163,16 @@ noinstr void fred_##func(struct pt_regs *regs)
 #define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)			\
 	DECLARE_IDTENTRY_ERRORCODE(vector, func)
 
+/**
+ * DECLARE_IDTENTRY_PF - Declare functions for page fault entry point
+ * @vector:	Vector number (ignored for C)
+ * @func:	Function name of the entry point
+ *
+ * Maps to @DECLARE_IDTENTRY_ERRORCODE().
+ */
+#define DECLARE_IDTENTRY_PF(vector, func)			\
+	DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+
 /**
  * DEFINE_IDTENTRY_RAW_ERRORCODE - Emit code for raw IDT entry points
  * @func:	Function name of the entry point
@@ -391,6 +401,15 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_DF(func)					\
 	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
+/**
+ * DEFINE_IDTENTRY_PF - Emit code for page fault
+ * @func:	Function name of the entry point
+ *
+ * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
+ */
+#define DEFINE_IDTENTRY_PF(func)					\
+	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+
 /**
  * DEFINE_IDTENTRY_VC_KERNEL - Emit code for VMM communication handler
  *			       when raised from kernel mode
@@ -480,6 +499,15 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
 #define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
 	idtentry vector asm_##func func has_error_code=1
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_PF(vector, func)				\
+	idtentry vector asm_##func func has_error_code=1		\
+	kernel_reentry_fn=handle_dynamic_stack_kernel_faults
+#else
+#define DECLARE_IDTENTRY_PF(vector, func)				\
+	DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+#endif
+
 /* Special case for 32bit IRET 'trap'. Do not emit ASM code */
 #define DECLARE_IDTENTRY_SW(vector, func)
 
@@ -494,8 +522,14 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
 	idtentry_irq vector func
 
 /* System vector entries */
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
+	idtentry vector asm_##func func has_error_code=0		\
+	kernel_reentry_fn=switch_to_kstack
+#else
 #define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
 	DECLARE_IDTENTRY(vector, func)
+#endif
 
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)				\
@@ -615,7 +649,7 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
-DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF,	exc_page_fault);
+DECLARE_IDTENTRY_PF(X86_TRAP_PF,		exc_page_fault);
 
 #if defined(CONFIG_IA32_EMULATION)
 DECLARE_IDTENTRY_RAW(IA32_SYSCALL_VECTOR,	int80_emulation);
@@ -699,7 +733,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 #endif
 
 #ifdef CONFIG_SMP
-DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
+DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR,		sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 7400dab373fe..b0b60f83a531 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -28,7 +28,15 @@
 #define	IST_INDEX_NMI		1
 #define	IST_INDEX_DB		2
 #define	IST_INDEX_MCE		3
-#define	IST_INDEX_VC		4
+#define	IST_INDEX_PF		4
+#define	IST_INDEX_UDI		5
+#define	IST_INDEX_VC		6
+
+/*
+ * Offset used for some IST stacks to reserve a slot for re-entry
+ * canary. At the very top of the stack for cache friendliness.
+ */
+#define IST_ENTRY_OFFSET	8
 
 /*
  * Set __PAGE_OFFSET to the most negative possible address +
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a24c7805acdb..fa790731dea0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -573,6 +573,12 @@ static inline void load_sp0(unsigned long sp0)
 
 #endif /* CONFIG_PARAVIRT_XXL */
 
+#ifdef CONFIG_DYNAMIC_STACK
+void install_nmi_pf_stack(bool use_nmi_pf_stack);
+#else
+static inline void install_nmi_pf_stack(bool use_nmi_pf_stack) {}
+#endif
+
 unsigned long __get_wchan(struct task_struct *p);
 
 extern void select_idle_routine(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ec0670114efa..d90a01e2fdd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2377,6 +2377,8 @@ static inline void tss_setup_ist(struct tss_struct *tss)
 	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
 	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
 	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+	tss->x86_tss.ist[IST_INDEX_PF] = __this_cpu_ist_top_va(PF) - IST_ENTRY_OFFSET;
+	tss->x86_tss.ist[IST_INDEX_UDI] = __this_cpu_ist_top_va(UDI) - IST_ENTRY_OFFSET;
 	/* Only mapped when SEV-ES is active */
 	tss->x86_tss.ist[IST_INDEX_VC] = __this_cpu_ist_top_va(VC);
 }
@@ -2665,3 +2667,12 @@ void __init arch_cpu_finalize_init(void)
 	 */
 	mem_encrypt_init();
 }
+
+#ifdef CONFIG_DYNAMIC_STACK
+noinstr void install_nmi_pf_stack(bool use_nmi_pf_stack)
+{
+	unsigned long stack = use_nmi_pf_stack ? __this_cpu_ist_top_va(PF2)
+					       : __this_cpu_ist_top_va(PF);
+	this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_PF], stack - IST_ENTRY_OFFSET);
+}
+#endif
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 6c5defd6569a..6784d31d3eb3 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -24,13 +24,16 @@ static const char * const exception_stack_names[] = {
 		[ ESTACK_NMI	]	= "NMI",
 		[ ESTACK_DB	]	= "#DB",
 		[ ESTACK_MCE	]	= "#MC",
+		[ ESTACK_PF	]	= "#PF",
+		[ ESTACK_PF2	]	= "#PF2",
+		[ ESTACK_UDI	]	= "#UDI",
 		[ ESTACK_VC	]	= "#VC",
 		[ ESTACK_VC2	]	= "#VC2",
 };
 
 const char *stack_type_name(enum stack_type type)
 {
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
 
 	if (type == STACK_TYPE_TASK)
 		return "TASK";
@@ -87,6 +90,9 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
 	EPAGERANGE(NMI),
 	EPAGERANGE(DB),
 	EPAGERANGE(MCE),
+	EPAGERANGE(PF),
+	EPAGERANGE(PF2),
+	EPAGERANGE(UDI),
 	EPAGERANGE(VC),
 	EPAGERANGE(VC2),
 };
@@ -98,7 +104,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
 	struct pt_regs *regs;
 	unsigned int k;
 
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
 
 	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
 	/*
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..7626fa7adfb3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -116,6 +116,10 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_VC,		asm_exc_vmm_communication, IST_INDEX_VC),
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+	ISTG(X86_TRAP_PF,		asm_exc_page_fault, IST_INDEX_PF),
+#endif
+
 	SYSG(X86_TRAP_OF,		asm_exc_overflow),
 };
 
@@ -127,47 +131,55 @@ static const struct idt_data ia32_idt[] __initconst = {
 #endif
 };
 
+#ifdef CONFIG_DYNAMIC_STACK
+#define EXTERNAL_INTR(_vector, _addr)	ISTG(_vector, _addr, IST_INDEX_UDI)
+#define EXTERNAL_INTR_IST_VALUE		(IST_INDEX_UDI + 1)
+#else
+#define EXTERNAL_INTR(_vector, _addr)	INTG(_vector, _addr)
+#define EXTERNAL_INTR_IST_VALUE		0
+#endif
+
 /*
  * The APIC and SMP idt entries
  */
 static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_SMP
-	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
-	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
-	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
-	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
+	EXTERNAL_INTR(RESCHEDULE_VECTOR,		asm_sysvec_reschedule_ipi),
+	EXTERNAL_INTR(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
+	EXTERNAL_INTR(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
+	EXTERNAL_INTR(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
 #ifdef CONFIG_X86_THERMAL_VECTOR
-	INTG(THERMAL_APIC_VECTOR,		asm_sysvec_thermal),
+	EXTERNAL_INTR(THERMAL_APIC_VECTOR,		asm_sysvec_thermal),
 #endif
 
 #ifdef CONFIG_X86_MCE_THRESHOLD
-	INTG(THRESHOLD_APIC_VECTOR,		asm_sysvec_threshold),
+	EXTERNAL_INTR(THRESHOLD_APIC_VECTOR,		asm_sysvec_threshold),
 #endif
 
 #ifdef CONFIG_X86_MCE_AMD
-	INTG(DEFERRED_ERROR_VECTOR,		asm_sysvec_deferred_error),
+	EXTERNAL_INTR(DEFERRED_ERROR_VECTOR,		asm_sysvec_deferred_error),
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
-	INTG(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
-	INTG(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
+	EXTERNAL_INTR(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
+	EXTERNAL_INTR(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
 # if IS_ENABLED(CONFIG_KVM)
-	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
-	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
-	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	EXTERNAL_INTR(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
+	EXTERNAL_INTR(POSTED_INTR_WAKEUP_VECTOR,	asm_sysvec_kvm_posted_intr_wakeup_ipi),
+	EXTERNAL_INTR(POSTED_INTR_NESTED_VECTOR,	asm_sysvec_kvm_posted_intr_nested_ipi),
 # endif
 #ifdef CONFIG_GUEST_PERF_EVENTS
 	INTG(PERF_GUEST_MEDIATED_PMI_VECTOR,	asm_sysvec_perf_guest_mediated_pmi_handler),
 #endif
 # ifdef CONFIG_IRQ_WORK
-	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
+	EXTERNAL_INTR(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
 # endif
-	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
-	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+	EXTERNAL_INTR(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
+	EXTERNAL_INTR(ERROR_APIC_VECTOR,		asm_sysvec_error_interrupt),
 # ifdef CONFIG_X86_POSTED_MSI
-	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+	EXTERNAL_INTR(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
 # endif
 #endif
 };
@@ -206,11 +218,12 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
-static __init void set_intr_gate(unsigned int n, const void *addr)
+static __init void set_intr_gate(unsigned int n, const void *addr, int ist)
 {
 	struct idt_data data;
 
 	init_idt_data(&data, n, addr);
+	data.bits.ist = ist;
 
 	idt_setup_from_table(idt_table, &data, 1, false);
 }
@@ -293,7 +306,7 @@ void __init idt_setup_apic_and_irq_gates(void)
 
 	for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
 		entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
-		set_intr_gate(i, entry);
+		set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
 	}
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -304,7 +317,7 @@ void __init idt_setup_apic_and_irq_gates(void)
 		 * /proc/interrupts.
 		 */
 		entry = spurious_entries_start + IDT_ALIGN * (i - FIRST_SYSTEM_VECTOR);
-		set_intr_gate(i, entry);
+		set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
 	}
 #endif
 	/* Map IDT into CPU entry area and reload it. */
@@ -325,10 +338,10 @@ void __init idt_setup_early_handler(void)
 	int i;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
-		set_intr_gate(i, early_idt_handler_array[i]);
+		set_intr_gate(i, early_idt_handler_array[i], DEFAULT_STACK);
 #ifdef CONFIG_X86_32
 	for ( ; i < NR_VECTORS; i++)
-		set_intr_gate(i, early_ignore_irq);
+		set_intr_gate(i, early_ignore_irq, DEFAULT_STACK);
 #endif
 	load_idt(&idt_descr);
 }
@@ -352,5 +365,5 @@ void __init idt_install_sysvec(unsigned int n, const void *function)
 		return;
 
 	if (!WARN_ON(test_and_set_bit(n, system_vectors)))
-		set_intr_gate(n, function);
+		set_intr_gate(n, function, EXTERNAL_INTR_IST_VALUE);
 }
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..a2444b9d5b71 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -37,6 +37,7 @@
 #include <asm/microcode.h>
 #include <asm/sev.h>
 #include <asm/fred.h>
+#include <asm/cpu_entry_area.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -581,6 +582,11 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 	if (IS_ENABLED(CONFIG_NMI_CHECK_CPU) && ignore_nmis) {
 		WRITE_ONCE(nsp->idt_ignored, nsp->idt_ignored + 1);
 	} else if (!ignore_nmis) {
+		bool protect_pf_ist_stack = is_pf_ist_stack(regs->sp);
+
+		if (protect_pf_ist_stack)
+			install_nmi_pf_stack(true);
+
 		if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) {
 			WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
 			WARN_ON_ONCE(!(nsp->idt_nmi_seq & 0x1));
@@ -590,6 +596,9 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 			WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
 			WARN_ON_ONCE(nsp->idt_nmi_seq & 0x1);
 		}
+
+		if (protect_pf_ist_stack)
+			install_nmi_pf_stack(false);
 	}
 
 	irqentry_nmi_exit(regs, irq_state);
diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index 24b48af27417..75b9f851f428 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -9,6 +9,7 @@
 #include <linux/instrumented.h>
 
 #include <asm/tlbflush.h>
+#include <asm/cpu_entry_area.h>
 
 /**
  * copy_from_user_nmi - NMI safe copy from user
@@ -39,6 +40,14 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
 	if (!nmi_uaccess_okay())
 		return n;
 
+	/*
+	 * IST stacks aren't reentrant, so bail before the possibility of
+	 * a #PF. While on the #PF IST stack, we should only need this
+	 * function for stack dumps (WARN/panic/etc).
+	 */
+	if (is_pf_ist_stack(current_stack_pointer))
+		return n;
+
 	/*
 	 * Even though this function is typically called from NMI/IRQ context
 	 * disable pagefaults so that its behaviour is consistent even when
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..97ac91c109ed 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -156,6 +156,12 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
 	cea_map_stack(DB);
 	cea_map_stack(MCE);
 
+	if (IS_ENABLED(CONFIG_DYNAMIC_STACK)) {
+		cea_map_stack(PF);
+		cea_map_stack(PF2);
+		cea_map_stack(UDI);
+	}
+
 	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
 		if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
 			cea_map_stack(VC);
@@ -173,6 +179,17 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_DYNAMIC_STACK
+bool noinstr is_pf_ist_stack(unsigned long addr)
+{
+	struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks);
+	unsigned long top = CEA_ESTACK_TOP(cs, PF2);
+	unsigned long bot = CEA_ESTACK_BOT(cs, PF);
+
+	return addr >= bot && addr < top;
+}
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static void __init setup_cpu_entry_area(unsigned int cpu)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 40d518d9f562..48ef50982c06 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,16 +1482,61 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 #ifdef CONFIG_DYNAMIC_STACK
 
-static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs, bool is_dynamic_stack_fault)
 {
 	unsigned long new_sp;
 	unsigned long data_len;
+	bool must_avoid_dynamic_stack_fault;
 
-	new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
-	new_sp &= FRED_STACK_FRAME_RSP_MASK;
-	data_len = sizeof(struct fred_frame);
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+		new_sp &= FRED_STACK_FRAME_RSP_MASK;
+		data_len = sizeof(struct fred_frame);
+		must_avoid_dynamic_stack_fault = false;
+	} else {
+		// Hardware aligns sp to a 16 byte boundary when going through the IDT.
+		new_sp = ALIGN_DOWN(regs->sp, 16);
+		data_len = sizeof(struct pt_regs);
+		must_avoid_dynamic_stack_fault = is_dynamic_stack_fault;
+	}
 	new_sp -= data_len;
 
+	if (must_avoid_dynamic_stack_fault) {
+		bool new_sp_on_stack;
+
+		/*
+		 * We don't have to worry about the window where current_task
+		 * is inconsistent during a context switch because interrupts
+		 * are disabled during that window and the only #PF that can
+		 * happen there is a dynamic stack fault, in which case we
+		 * return directly from handle_dynamic_stack_kernel_faults().
+		 */
+		if (!in_nmi())
+			dynamic_stack_fault(current, new_sp, &new_sp_on_stack);
+		else
+			new_sp_on_stack = false;
+
+		/*
+		 * If new_sp isn't on the current task's stack, verify that it's
+		 * on an exception/irq/entry stack. This is a little expensive,
+		 * but #PFs in those contexts should be rare.
+		 */
+		if (!new_sp_on_stack) {
+			struct stack_info info, info2;
+
+			if (!get_stack_info_noinstr((void *)new_sp, current, &info)) {
+				instrumentation_begin();
+				if (get_stack_info_noinstr((void *)(new_sp - PAGE_SIZE),
+							   current, &info2)) {
+					pr_emerg("Stack overflow during stack switch\n");
+					handle_stack_overflow(regs, new_sp, &info2);
+				} else {
+					die("Stack switch back to unknown stack", regs, 0);
+				}
+			}
+		}
+	}
+
 	memcpy((void *)new_sp, regs, data_len);
 
 	return new_sp;
@@ -1499,7 +1544,7 @@ static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
 
 __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
 {
-	return copy_stack_data(regs);
+	return copy_stack_data(regs, false);
 }
 
 #define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
@@ -1510,7 +1555,7 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
 	struct task_struct *tsk;
 	bool on_stack;
 
-	address = fred_event_data(regs);
+	address = cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs) : read_cr2();
 	if (fault_in_kernel_space(address) && !in_nmi()) {
 		tsk = task_from_stack_address(address);
 
@@ -1522,18 +1567,19 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
 	}
 
 	/*
-	 * The regular fault handler won't sleep when executing in an
-	 * atomic context, so we can complete the #PF directly on the
-	 * #PF stack.
+	 * The regular fault handler won't sleep when executing in an atomic
+	 * context, so we can complete the #PF directly on the #PF stack.
+	 * However, IST doesn't support nested exceptions, so we need to avoid
+	 * running any non-noinstr code on the IST #PF stack.
 	 */
-	if (in_atomic())
+	if (in_atomic() && cpu_feature_enabled(X86_FEATURE_FRED))
 		return (unsigned long)regs;
 	else
-		return copy_stack_data(regs);
+		return copy_stack_data(regs, true);
 }
 #endif
 
-DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+DEFINE_IDTENTRY_PF(exc_page_fault)
 {
 	irqentry_state_t state;
 	unsigned long address;
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
                   ` (12 preceding siblings ...)
  2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
@ 2026-04-24 19:41 ` Dave Hansen
  2026-04-24 21:35   ` Pasha Tatashin
  2026-04-25  9:19   ` H. Peter Anvin
  13 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2026-04-24 19:41 UTC (permalink / raw)
  To: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: linux-kernel, linux-mm

On 4/24/26 12:14, David Stevens wrote:
> The question is then: is this approach something that is fundamentally
> untenable in the kernel

Yes. Fundamentally untenable.

Not allowing stack faults has been a wonderful simplification. It's one
of those things that just plain makes the kernel easier to maintain.
Saving low single digits of system memory is not exactly making me eager
to go back to the harder-to-maintain days.

I seriously doubt that this 1% is the lowest hanging fruit for memory
bloat on these systems. ;)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
@ 2026-04-24 21:35   ` Pasha Tatashin
  2026-04-24 22:21     ` Dave Hansen
  2026-04-24 22:26     ` David Laight
  2026-04-25  9:19   ` H. Peter Anvin
  1 sibling, 2 replies; 41+ messages in thread
From: Pasha Tatashin @ 2026-04-24 21:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
> > The question is then: is this approach something that is fundamentally
> > untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

This true until, in a fleet of millions of machines, you encounter a 
one-in-a-billion chance of a stack overflow. You are then forced to 
double the statically allocated kernel stacks on every machine, paying a 
memory tax even though 99.999..% of threads never exceed 4K. This 
overhead accumulates to petabytes of wasted capacity.

Pasha


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 21:35   ` Pasha Tatashin
@ 2026-04-24 22:21     ` Dave Hansen
  2026-04-24 22:49       ` David Stevens
  2026-04-24 22:26     ` David Laight
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2026-04-24 22:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Stevens, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 4/24/26 14:35, Pasha Tatashin wrote:
> On 04-24 12:41, Dave Hansen wrote:
>> On 4/24/26 12:14, David Stevens wrote:
>>> The question is then: is this approach something that is fundamentally
>>> untenable in the kernel
>> Yes. Fundamentally untenable.
>>
>> Not allowing stack faults has been a wonderful simplification. It's one
>> of those things that just plain makes the kernel easier to maintain.
>> Saving low single digits of system memory is not exactly making me eager
>> to go back to the harder-to-maintain days.
>>
>> I seriously doubt that this 1% is the lowest hanging fruit for memory
>> bloat on these systems. 😉
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

I don't disagree with you. But, at that point, you're picking your
poison: bugs dynamic kernel stacks versus crashes from stack overflows.

At some point, I might be able to be talked into dynamic stack as a
FRED-only feature. But FRED isn't widespread enough to go to the trouble
today. I'm sure the folks who want this also don't want to wait until
all the devices in the field have FRED because that even *longer* off.

So maybe this is one of those things that folks just need to deploy
out-of-tree for a couple of years, come back with some data to show us
that we were just paranoid, and we'll look at it again.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 21:35   ` Pasha Tatashin
  2026-04-24 22:21     ` Dave Hansen
@ 2026-04-24 22:26     ` David Laight
  2026-04-24 23:06       ` Pasha Tatashin
  2026-06-19  0:29       ` Dave Hansen
  1 sibling, 2 replies; 41+ messages in thread
From: David Laight @ 2026-04-24 22:26 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Dave Hansen, David Stevens, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, 24 Apr 2026 21:35:20 +0000
Pasha Tatashin <pasha.tatashin@soleen.com> wrote:

> On 04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:  
> > > The question is then: is this approach something that is fundamentally
> > > untenable in the kernel  
> > 
> > Yes. Fundamentally untenable.
> > 
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> > 
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)  
> 
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

And then you hit a stack fault in some path where you can't sleep and
there isn't any available kernel memory.

An alternative idea is to arrange for some system calls to sleep in
userspace, so when the thread is woken it re-executes the system call.
It then makes sense to assign the kernel stack to the process when
it enters the kernel.
That might mean that you don't need a kernel stack for all the threads
sleeping in futex() - it might even be possible to do the retry in
userspace saving the second kernel entry most of the time.
It is all 'hard and difficult' though.

The easier solution is to rewrite the system code so it doesn't have
1000s of threads :-)

	David



> 
> Pasha
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 22:21     ` Dave Hansen
@ 2026-04-24 22:49       ` David Stevens
  0 siblings, 0 replies; 41+ messages in thread
From: David Stevens @ 2026-04-24 22:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, Apr 24, 2026 at 3:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 4/24/26 14:35, Pasha Tatashin wrote:
> > On 04-24 12:41, Dave Hansen wrote:
> >> On 4/24/26 12:14, David Stevens wrote:
> >>> The question is then: is this approach something that is fundamentally
> >>> untenable in the kernel
> >> Yes. Fundamentally untenable.
> >>
> >> Not allowing stack faults has been a wonderful simplification. It's one
> >> of those things that just plain makes the kernel easier to maintain.
> >> Saving low single digits of system memory is not exactly making me eager
> >> to go back to the harder-to-maintain days.
> >>
> >> I seriously doubt that this 1% is the lowest hanging fruit for memory
> >> bloat on these systems. 😉
> > This true until, in a fleet of millions of machines, you encounter a
> > one-in-a-billion chance of a stack overflow. You are then forced to
> > double the statically allocated kernel stacks on every machine, paying a
> > memory tax even though 99.999..% of threads never exceed 4K. This
> > overhead accumulates to petabytes of wasted capacity.
>
> I don't disagree with you. But, at that point, you're picking your
> poison: bugs dynamic kernel stacks versus crashes from stack overflows.
>
> At some point, I might be able to be talked into dynamic stack as a
> FRED-only feature. But FRED isn't widespread enough to go to the trouble
> today. I'm sure the folks who want this also don't want to wait until
> all the devices in the field have FRED because that even *longer* off.

Why does this need to be FRED only? True, the lack of reentrancy with
IST stacks complicates a few situations. That adds some complexity
beyond what's needed for FRED-only support, but the additional
complexity doesn't really seem like a hard blocker, at least if we
accept the complexity of kernel stack faults for FRED.

-David


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 22:26     ` David Laight
@ 2026-04-24 23:06       ` Pasha Tatashin
  2026-06-19  0:29       ` Dave Hansen
  1 sibling, 0 replies; 41+ messages in thread
From: Pasha Tatashin @ 2026-04-24 23:06 UTC (permalink / raw)
  To: David Laight
  Cc: Pasha Tatashin, Dave Hansen, David Stevens, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andy Lutomirski, Xin Li, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Uladzislau Rezki, Kees Cook, linux-kernel, linux-mm, willy

On 04-24 23:26, David Laight wrote:
> On Fri, 24 Apr 2026 21:35:20 +0000
> Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> 
> > On 04-24 12:41, Dave Hansen wrote:
> > > On 4/24/26 12:14, David Stevens wrote:  
> > > > The question is then: is this approach something that is fundamentally
> > > > untenable in the kernel  
> > > 
> > > Yes. Fundamentally untenable.
> > > 
> > > Not allowing stack faults has been a wonderful simplification. It's one
> > > of those things that just plain makes the kernel easier to maintain.
> > > Saving low single digits of system memory is not exactly making me eager
> > > to go back to the harder-to-maintain days.
> > > 
> > > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > > bloat on these systems. ;)  
> > 
> > This true until, in a fleet of millions of machines, you encounter a 
> > one-in-a-billion chance of a stack overflow. You are then forced to 
> > double the statically allocated kernel stacks on every machine, paying a 
> > memory tax even though 99.999..% of threads never exceed 4K. This 
> > overhead accumulates to petabytes of wasted capacity.
> 
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.

Well, at least if we hit this rare case, we can simply double a buffer 
of pre-reserved stack memory per CPU. This still saves significant 
memory compared to wasting it on every single thread.

> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.
> That might mean that you don't need a kernel stack for all the threads
> sleeping in futex() - it might even be possible to do the retry in
> userspace saving the second kernel entry most of the time.
> It is all 'hard and difficult' though.

I was thinking about a similar approach as well—sort of multiplexing the 
kernel stacks. But honestly, when trying to cover all the edge cases, I 
didn't find it to be any better or easier than just using dynamic kernel 
stacks.

An alternative approach, which was proposed at LSFMM by Willy, is to add 
an explicit deep stack calls. When we enter a path that we know is 
exceptionally deep, only then do we extend the stack, keeping the 
default (say, 8K) everywhere else.

> The easier solution is to rewrite the system code so it doesn't have
> 1000s of threads :-)

That ship sailed in the early 90s of the previous millennium.  Nowadays, 
we have high end workstations with almost 200 hardware threads. 
Rewriting system code to reduce thread counts simply isn't an option for 
our storage machines, which have millions of threads per unit.

+CC Matthew Wilcox


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
  2026-04-24 21:35   ` Pasha Tatashin
@ 2026-04-25  9:19   ` H. Peter Anvin
  2026-04-27 16:17     ` Dave Hansen
  2026-04-27 16:31     ` Pasha Tatashin
  1 sibling, 2 replies; 41+ messages in thread
From: H. Peter Anvin @ 2026-04-25  9:19 UTC (permalink / raw)
  To: Dave Hansen, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: linux-kernel, linux-mm

On 2026-04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
>> The question is then: is this approach something that is fundamentally
>> untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

It is worth noting that this was one of the VERY early design decisions that
has shaped Linux from the beginning:

- No swapping of kernel memory
- Kernel stacks are statically allocated
- Physical RAM is mapped into the kernel at all times
- A "monolithic" kernel using function calls, not message passing
- A kernel interface that closely maps to the low-level application API
  (e.g. each user space thread is a kernel thread.)
- Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
  in user space.

Those design decisions are, by and large, what has made Linux Linux: a
relatively simple, highly performant, and reliable system.

	-hpa



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-25  9:19   ` H. Peter Anvin
@ 2026-04-27 16:17     ` Dave Hansen
  2026-06-18 14:50       ` Zach O'Keefe
  2026-04-27 16:31     ` Pasha Tatashin
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2026-04-27 16:17 UTC (permalink / raw)
  To: H. Peter Anvin, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: linux-kernel, linux-mm

On 4/25/26 02:19, H. Peter Anvin wrote:
> It is worth noting that this was one of the VERY early design decisions that
> has shaped Linux from the beginning:
> 
> - No swapping of kernel memory
> - Kernel stacks are statically allocated
...

One other bit to add here: In the past, kernel faults on kernel memory
have been allowed, like to populate vmalloc() page table entries into
the parts of the page tables that are not shared across processes. Even
*that* turned out to be too much of a pain even though it didn't involve
allocation, and the kernel has been moving away from that.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-25  9:19   ` H. Peter Anvin
  2026-04-27 16:17     ` Dave Hansen
@ 2026-04-27 16:31     ` Pasha Tatashin
  1 sibling, 0 replies; 41+ messages in thread
From: Pasha Tatashin @ 2026-04-27 16:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Hansen, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 04-25 02:19, H. Peter Anvin wrote:
> On 2026-04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:
> >> The question is then: is this approach something that is fundamentally
> >> untenable in the kernel
> > 
> > Yes. Fundamentally untenable.
> > 
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> > 
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)
> 
> It is worth noting that this was one of the VERY early design decisions that
> has shaped Linux from the beginning:
> 
> - No swapping of kernel memory
> - Kernel stacks are statically allocated
> - Physical RAM is mapped into the kernel at all times
> - A "monolithic" kernel using function calls, not message passing
> - A kernel interface that closely maps to the low-level application API
>   (e.g. each user space thread is a kernel thread.)
> - Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
>   in user space.
> 
> Those design decisions are, by and large, what has made Linux Linux: a
> relatively simple, highly performant, and reliable system.

I think there is a bit of survivorship bias in that list. Originally,
there were many other foundational assumptions that have since evolved
as hardware and requirements scaled.

For example, there were assumptions about no dynamic hardware
reconfiguration (no memory/CPU hot-plug), uniform memory access (no
NUMA), and fixed page sizes (no THP or HugeTLB). All of those have
changed, and you, better than most, know of many other such examples.

A more recent example is PREEMPT_RT: the Linux kernel was originally
designed to be non-preemptible.

Even the assumptions in your list, such as "physical RAM is mapped into
the kernel at all times," are evolving: emulated pmem is not mapped, and
guestmemfd plans to allow unmapping memory from the direct map for
security reasons.

Aside from trying our best not to break user space and allowing the
internal kernel API to evolve, the other items are architectural
decisions that can and should adapt to new requirements.

We now have machines with thousands of hardware threads. Running
millions of software threads on such machines is a practical reality,
and at fleet scales, statically allocating kernel stacks for all of them
wastes a massive amount of memory.

The proposed solution won't affect Linux as a whole. It can be
optionally enabled for targeted configurations. Additionally, the max
stack size is still statically set; it simply isn't populated until
actually used.

Pasha

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-27 16:17     ` Dave Hansen
@ 2026-06-18 14:50       ` Zach O'Keefe
  2026-06-18 18:53         ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Zach O'Keefe @ 2026-06-18 14:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: H. Peter Anvin, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Mon, Apr 27, 2026 at 9:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/25/26 02:19, H. Peter Anvin wrote:
> > It is worth noting that this was one of the VERY early design decisions that
> > has shaped Linux from the beginning:
> >
> > - No swapping of kernel memory
> > - Kernel stacks are statically allocated
> ...
>
> One other bit to add here: In the past, kernel faults on kernel memory
> have been allowed, like to populate vmalloc() page table entries into
> the parts of the page tables that are not shared across processes. Even
> *that* turned out to be too much of a pain even though it didn't involve
> allocation, and the kernel has been moving away from that.
>

Dave,

Necroing this thread, as the potential aggregate savings continue to
stand out for us on the datacenter side (whereas David is motivated
separately, from the consumer device side).

I certainly empathize with your position, and hesitation to give up
such a nice simplification just to invite new headaches.

However, I'd still like to work with you to understand what feasible
path forward you see, hoping you can proactively steer us away from
some of the bigger headaches.

I think we are fine being forward-looking, and only supporting this
for FRED (which is on our doorstep). That said, understanding the
issues you foresee with the IST approach would still be valuable, as
it might save us internal trouble should we choose to carry it
temporarily to bridge the gap with FRED.

Overall, are there any particular painpoints you'd like to see flushed
out, first? How would you like to proceed? Would explicitly marking
this as an experimental config, in the interim, be more attractive?

Thanks, and I appreciate any help or guidance here.

Best,
Zach

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-18 14:50       ` Zach O'Keefe
@ 2026-06-18 18:53         ` Dave Hansen
  2026-06-18 22:28           ` H. Peter Anvin
  2026-06-19 12:45           ` Thomas Gleixner
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2026-06-18 18:53 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: H. Peter Anvin, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 6/18/26 07:50, Zach O'Keefe wrote:
> Overall, are there any particular painpoints you'd like to see flushed
> out, first? 

Handing exceptions in the kernel is hard. Period. That's the pain point.
Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
we've moved away from ever taking random page faults in the kernel. Or,
heck, randomly taking faults at *all*. We've concentrated them in very
specific places, not in general code.

Now you're arguing that the kernel can pretty much take a fault *AND*
allocate memory reliably at any point*.

I just don't see the collateral in this series to justify that claim.

The NMI entry code is a disaster because NMIs can happen anywhere. The
#VC code is a disaster because #VCs can happen anywhere. Once #PF can
happen anywhere*, why won't #PF become a disaster?

It would be a completely different story if there was a track record of
finding and fixing bugs in the x86 entry code from the authors of this
series. But I don't think I've ever seen a single email from your folks
before this, much less a review tag or a patch. I'd be much happier if
you got Andy L's blessing on this, for example.

> How would you like to proceed? Would explicitly marking this as an
> experimental config, in the interim, be more attractive?
No.

The enemy here is complexity. *Maintenance* complexity. Being able to
compile out some of the complexity helps with debugging. But it doesn't
help maintaining the code.

--

* #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
  you that. But it's still pretty darn bad.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-18 18:53         ` Dave Hansen
@ 2026-06-18 22:28           ` H. Peter Anvin
  2026-06-19  0:40             ` David Stevens
  2026-06-19 12:45           ` Thomas Gleixner
  1 sibling, 1 reply; 41+ messages in thread
From: H. Peter Anvin @ 2026-06-18 22:28 UTC (permalink / raw)
  To: Dave Hansen, Zach O'Keefe
  Cc: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 2026-06-18 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
> 
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
> 
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
> 
> I just don't see the collateral in this series to justify that claim.
> 

That is most definitely the zeroth-order thing. Extraordinary claims require
extraordinary evidence, and this is certainly an extraordinary claim. In
addition to the *massive* maintainability issue, you also have to consider the
additional overheads you will now have to deal with in order to avoid deadlocks.

Almost every OS that have attempted to swap out kernel stacks have been known
to suffer from deadlocks under very high memory load.

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?
> [...]
> * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
>   you that. But it's still pretty darn bad.

In some ways, they are actually *worse*.

#PFs need to be able to sleep, because the common case for a #PF in the kernel
is that it touched user space. This means #PF needs to be using IST/SL 0.
However, this is obviously incompatible with handling #PFs on the kernel stack
itself, so now it needs a stack switch. In the common case, it will then need
to demote the #PF back onto the normal execution stack, which is complex in
its own right.

Now, if you are on a pre-FRED system, the IST entries don't nest, so you
absolutely have to make sure you can't get there again through any means
whatsoever. With FRED, it isn't quite so dire, but it will still give you lots
of fun if that interrupt is one which would like to be demoted off the IRQ stack.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
> 
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
> 
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.
Indeed. Paravirtualization is a great example of how this works. The PV hooks
in the kernel are still a maintenance nightmare 20 years after they were
introduced, and mostly that cost is not borne by the people who introduced and
benefited from them.

	-hpa

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-04-24 22:26     ` David Laight
  2026-04-24 23:06       ` Pasha Tatashin
@ 2026-06-19  0:29       ` Dave Hansen
  2026-06-19 19:56         ` Zach O'Keefe
  2026-06-20  5:25         ` David Stevens
  1 sibling, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2026-06-19  0:29 UTC (permalink / raw)
  To: David Laight, Pasha Tatashin
  Cc: David Stevens, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 4/24/26 15:26, David Laight wrote:
>> This true until, in a fleet of millions of machines, you encounter a 
>> one-in-a-billion chance of a stack overflow. You are then forced to 
>> double the statically allocated kernel stacks on every machine, paying a 
>> memory tax even though 99.999..% of threads never exceed 4K. This 
>> overhead accumulates to petabytes of wasted capacity.
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.
> 
> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.

There are probably other ways to do this without handling exceptions.

For instance, let's say you always *map* 16k of stack for each process.
But, after context switching out, you take a look at 4x8b pte_t's that
were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
can just clear _PAGE_PRESENT and reclaim the page.

If you don't want the overhead in the normal context switch path, you
reclaim in a shrinker, at the cost of needing locking to coordinate with
the scheduler.

A simple rule would be: a thread that ever accesses a page gets to keep
it forever. They're never reclaimed after being accessed, only before.

For that, the worst case is that you go to schedule a new thread and
can't allocate memory fill in the 4 pte_t's. You can't run it until you
or some other CPU goes and does some reclaim.

Needing memory in the middle of schedule() is generally a no-go. But its
a lot better than not being able to continue _execution_ of a kernel
thread at *ALL*, possibly in a non-preemptible context, like when you do
it in a #PF.

Basically, I think there's a way to do this that limits the kernel blast
radius to _mostly_ being a core mm problem.

What else has been considered before the #PF-based mechanism?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-18 22:28           ` H. Peter Anvin
@ 2026-06-19  0:40             ` David Stevens
  2026-06-19  0:44               ` H. Peter Anvin
  0 siblings, 1 reply; 41+ messages in thread
From: David Stevens @ 2026-06-19  0:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Hansen, Zach O'Keefe, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Thu, Jun 18, 2026 at 3:28 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-06-18 11:53, Dave Hansen wrote:
> > On 6/18/26 07:50, Zach O'Keefe wrote:
> >> Overall, are there any particular painpoints you'd like to see flushed
> >> out, first?
> >
> > Handing exceptions in the kernel is hard. Period. That's the pain point.
> > Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> > we've moved away from ever taking random page faults in the kernel. Or,
> > heck, randomly taking faults at *all*. We've concentrated them in very
> > specific places, not in general code.
> >
> > Now you're arguing that the kernel can pretty much take a fault *AND*
> > allocate memory reliably at any point*.
> >
> > I just don't see the collateral in this series to justify that claim.
> >
>
> That is most definitely the zeroth-order thing. Extraordinary claims require
> extraordinary evidence, and this is certainly an extraordinary claim.

I do acknowledge that there is currently a lack of evidence - this is
an RFC after all. The question is whether it is possible in principle
to produce sufficient evidence. From the Android side of Google, we
are willing to carry the RFC patches downstream for a while to build a
case for merging them upstream. However, there needs to be at least a
possibility of success before we undertake that work. If upstream's
position is that dynamic stacks are no good, full stop, and will
absolutely never happen, then there's no point in us trying to pursue
this avenue further. And I assume those from the datacenter side of
the company are in a similar position.

-David


> In addition to the *massive* maintainability issue, you also have to consider the
> additional overheads you will now have to deal with in order to avoid deadlocks.
>
> Almost every OS that have attempted to swap out kernel stacks have been known
> to suffer from deadlocks under very high memory load.
>
>
> > The NMI entry code is a disaster because NMIs can happen anywhere. The
> > #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> > happen anywhere*, why won't #PF become a disaster?
> > [...]
> > * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
> >   you that. But it's still pretty darn bad.
>
> In some ways, they are actually *worse*.
>
> #PFs need to be able to sleep, because the common case for a #PF in the kernel
> is that it touched user space. This means #PF needs to be using IST/SL 0.
> However, this is obviously incompatible with handling #PFs on the kernel stack
> itself, so now it needs a stack switch. In the common case, it will then need
> to demote the #PF back onto the normal execution stack, which is complex in
> its own right.
>
> Now, if you are on a pre-FRED system, the IST entries don't nest, so you
> absolutely have to make sure you can't get there again through any means
> whatsoever. With FRED, it isn't quite so dire, but it will still give you lots
> of fun if that interrupt is one which would like to be demoted off the IRQ stack.
>
> > It would be a completely different story if there was a track record of
> > finding and fixing bugs in the x86 entry code from the authors of this
> > series. But I don't think I've ever seen a single email from your folks
> > before this, much less a review tag or a patch. I'd be much happier if
> > you got Andy L's blessing on this, for example.
> >
> >> How would you like to proceed? Would explicitly marking this as an
> >> experimental config, in the interim, be more attractive?
> > No.
> >
> > The enemy here is complexity. *Maintenance* complexity. Being able to
> > compile out some of the complexity helps with debugging. But it doesn't
> > help maintaining the code.
> Indeed. Paravirtualization is a great example of how this works. The PV hooks
> in the kernel are still a maintenance nightmare 20 years after they were
> introduced, and mostly that cost is not borne by the people who introduced and
> benefited from them.
>
>         -hpa
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19  0:40             ` David Stevens
@ 2026-06-19  0:44               ` H. Peter Anvin
  0 siblings, 0 replies; 41+ messages in thread
From: H. Peter Anvin @ 2026-06-19  0:44 UTC (permalink / raw)
  To: David Stevens
  Cc: Dave Hansen, Zach O'Keefe, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 2026-06-18 17:40, David Stevens wrote:
>>
>> That is most definitely the zeroth-order thing. Extraordinary claims require
>> extraordinary evidence, and this is certainly an extraordinary claim.
> 
> I do acknowledge that there is currently a lack of evidence - this is
> an RFC after all. The question is whether it is possible in principle
> to produce sufficient evidence. From the Android side of Google, we
> are willing to carry the RFC patches downstream for a while to build a
> case for merging them upstream. However, there needs to be at least a
> possibility of success before we undertake that work. If upstream's
> position is that dynamic stacks are no good, full stop, and will
> absolutely never happen, then there's no point in us trying to pursue
> this avenue further. And I assume those from the datacenter side of
> the company are in a similar position.
> 

The answer is pretty much that you would have to present *very* 
impressive-looking evidence. There is very little that's completely 
absolute, but you definitely have a tall hill to climb on this one.

Keep also in mind that there are also people who claim that our current 
page sizes are much too small, and that the kernel should be doing 16K 
or 64K pages. At that point this more or less evaporates, too.

	-hpa



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-18 18:53         ` Dave Hansen
  2026-06-18 22:28           ` H. Peter Anvin
@ 2026-06-19 12:45           ` Thomas Gleixner
  2026-06-19 19:20             ` Zach O'Keefe
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-06-19 12:45 UTC (permalink / raw)
  To: Dave Hansen, Zach O'Keefe
  Cc: H. Peter Anvin, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.

There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?

It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.

Correct.

Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.

  "It is possible for #MCE to occur on the #PF IST stack, but the #MCE
   handler shouldn't generate new #PFs. The reentrancy check on the #PF
   stack will trigger if any recoverable #MCEs do generate #PFs - if there
   are actually reports of it happening, we can address it then."

Seriously?

We don't wait until the report comes in because the report won't even
happen in the worst case:

       #PF on IST
         ...
         cmp    0, reentrance
         jne	abort

       #MC
          ...
          #PF rewinds #PF IST
          cmp   0, reentrance
          jne	abort		<- Not taken because #MC happened before
                                   it could be set.

IST is fundamentally not suitable for this and I'm sure there are more
holes in this.

I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.

The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.

I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19 12:45           ` Thomas Gleixner
@ 2026-06-19 19:20             ` Zach O'Keefe
  2026-06-19 21:59               ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Zach O'Keefe @ 2026-06-19 19:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, H. Peter Anvin, David Stevens, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

> Aside of that the part which worries me most is the IDT hackery. That's
> fragile as hell and full of unvalidated assumptions. Reading "should not
> happen" several times in a changelog doesn't make me more confident.
>
>   "It is possible for #MCE to occur on the #PF IST stack, but the #MCE
>    handler shouldn't generate new #PFs. The reentrancy check on the #PF
>    stack will trigger if any recoverable #MCEs do generate #PFs - if there
>    are actually reports of it happening, we can address it then."
>
> Seriously?
>
> We don't wait until the report comes in because the report won't even
> happen in the worst case:
>
>        #PF on IST
>          ...
>          cmp    0, reentrance
>          jne    abort
>
>        #MC
>           ...
>           #PF rewinds #PF IST
>           cmp   0, reentrance
>           jne   abort           <- Not taken because #MC happened before
>                                    it could be set.
>
> IST is fundamentally not suitable for this and I'm sure there are more
> holes in this.
>
> I haven't looked at the FRED side of affairs yet in detail, but the
> handwavy explanation about external interrupts having to be moved to
> stack level 1 and unconditionally bounced back does not really make it
> appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
> helpful, but papering over the problem without understanding the root
> cause is not cutting it. If it's a genuine FRED hardware issue, then
> this needs to be understood and documented.
>
> The x86 folks have spent a lot of time to make the horrific x86
> interrupt and exception handling solid and therefore have zero interest
> to deal with the fallout of something based on "shouldn't happen"
> assumptions. Either it can prove correctness under all circumstances or
> not.
>
> I understand the save tons of memory accross a fleet argument, but a
> large fleet is also a guarantee to trigger all the "should not happen
> and impropable" issues which are gracefully handwaved away. That's a
> truly bad tradeoff as it ends up in non-decodable bug reports. What's
> worse the have to be handled by the maintainers and not necessarily by
> those who implemented it.

Thanks Dave / Thomas / Hans ; I appreciate your time taking a look at this.

As Dave previously pointed out, I'll admit to some ignorance regarding
the subtle nuances of x86 interrupt / exception handling. Counter to
my goals here, that code has "just worked," so attention and time have
been spent elsewhere. We'll undoubtedly need help making things solid
and avoiding previous pitfalls. As David mentions, this is an RFC.

While it seems common opinion that the IST-based solution is fragile,
what of FRED? It seems like this is exactly the kind of support needed
to avoid some of the aforementioned sw "mess" in various x86 exception
handling paths. I agree that it's less-than-ideal that we are forced
to downgrade exception levels in the common #PF case, but is that an
unsurmountable problem? Pardon my ignorance.

Lastly, I just want to clarify what folks have meant by "extraordinary
claims" or "evidence".  Aside from the above discussion on FRED
exception handling, the "only" other part of this is the allocation.
Are people concerned about memory unavailability, deadlocking-type
issues, or something else? We have considerable design freedom here to
avoid certain classes of unreliability, but—barring any clever
tricks—I don't know if the allocation can be guaranteed to succeed in
all conceivable circumstances. I want to ensure that reality does not
present a hard blocker.

Again, thanks everyone for the time and help,

Best,
Zach

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19  0:29       ` Dave Hansen
@ 2026-06-19 19:56         ` Zach O'Keefe
  2026-06-20  5:25         ` David Stevens
  1 sibling, 0 replies; 41+ messages in thread
From: Zach O'Keefe @ 2026-06-19 19:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Laight, Pasha Tatashin, David Stevens, Linus Walleij,
	Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andy Lutomirski, Xin Li, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Uladzislau Rezki, Kees Cook, linux-kernel, linux-mm

On Thu, Jun 18, 2026 at 5:29 PM Dave Hansen <dave.hansen@intel.com> wrote:

Thanks for the thoughts, Dave

> On 4/24/26 15:26, David Laight wrote:
> >> This true until, in a fleet of millions of machines, you encounter a
> >> one-in-a-billion chance of a stack overflow. You are then forced to
> >> double the statically allocated kernel stacks on every machine, paying a
> >> memory tax even though 99.999..% of threads never exceed 4K. This
> >> overhead accumulates to petabytes of wasted capacity.
> > And then you hit a stack fault in some path where you can't sleep and
> > there isn't any available kernel memory.
> >
> > An alternative idea is to arrange for some system calls to sleep in
> > userspace, so when the thread is woken it re-executes the system call.
> > It then makes sense to assign the kernel stack to the process when
> > it enters the kernel.
>
> There are probably other ways to do this without handling exceptions.
>
> For instance, let's say you always *map* 16k of stack for each process.
> But, after context switching out, you take a look at 4x8b pte_t's that
> were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
> can just clear _PAGE_PRESENT and reclaim the page.
>
> If you don't want the overhead in the normal context switch path, you
> reclaim in a shrinker, at the cost of needing locking to coordinate with
> the scheduler.
>
> A simple rule would be: a thread that ever accesses a page gets to keep
> it forever. They're never reclaimed after being accessed, only before.

That's an interesting take; but it's a one-way latch, right? How do we
know that task won't dive deeper, later?

> For that, the worst case is that you go to schedule a new thread and
> can't allocate memory fill in the 4 pte_t's. You can't run it until you
> or some other CPU goes and does some reclaim.
>
> Needing memory in the middle of schedule() is generally a no-go. But its
> a lot better than not being able to continue _execution_ of a kernel
> thread at *ALL*, possibly in a non-preemptible context, like when you do
> it in a #PF.
>
> Basically, I think there's a way to do this that limits the kernel blast
> radius to _mostly_ being a core mm problem.
>
> What else has been considered before the #PF-based mechanism?

The only other way to know on-demand when to increase the stack size
is through stack probing, which I've ruled out without further
consideration due to performance.

Then there is a class of solutions to explicitly grow / run certain
code paths on larger stacks. Though instrumentation may help, others
have described it as playing whack-a-mole.

Then there are solutions that use a shared pool of kernel stacks,
blocking userspace until one becomes available. Very disruptive.

I personally haven't explored any of these in great depth.

To me, handling this on-demand in #PF, though technically challenging,
offered (1) the most memory savings, (2) the least disruption to
userspace, and (3) (ironically, expected to be) the most maintainable,
general solution with the least perf impact.

Happy to consider other ideas, and again, I appreciate your time and thoughts.

Best,
Zach


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19 19:20             ` Zach O'Keefe
@ 2026-06-19 21:59               ` Thomas Gleixner
  2026-06-20  5:02                 ` David Stevens
  2026-06-20 19:33                 ` Zach O'Keefe
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Gleixner @ 2026-06-19 21:59 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Dave Hansen, H. Peter Anvin, David Stevens, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

Zach!

On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> While it seems common opinion that the IST-based solution is fragile,
> what of FRED? It seems like this is exactly the kind of support needed
> to avoid some of the aforementioned sw "mess" in various x86 exception
> handling paths. I agree that it's less-than-ideal that we are forced
> to downgrade exception levels in the common #PF case, but is that an
> unsurmountable problem? Pardon my ignorance.

The #PF path is considered perfomance critical. But how much the
downgrade matters needs actual numbers to analyze under various workload
scenarios.

I've not seen numbers to that effect anywhere. The only numbers provided
are marketing material about the memory savings on a freshly booted idle
machine. There are _zero_ numbers about the actual real world savings,
but claims about the PETABYTE savings possible.

Seriously?

> Lastly, I just want to clarify what folks have meant by "extraordinary
> claims" or "evidence".  Aside from the above discussion on FRED
> exception handling, the "only" other part of this is the allocation.

Clearly anything which is explained with "shouldn't happen" and
"unlikely". At cloud scale nothing is unlikely anymore. That's simply the
reality of statistical math.

As I pointed out before the same applies to the unexplained
upgrade/downgrade game with external interrupts. Such issues cannot be
papered over without understanding the root cause as from decades long
experience they come inevitably back some time down the road. Cloud
scale even guarantees that.

> Are people concerned about memory unavailability, deadlocking-type
> issues, or something else? We have considerable design freedom here to
> avoid certain classes of unreliability, but—barring any clever
> tricks—I don't know if the allocation can be guaranteed to succeed in
> all conceivable circumstances. I want to ensure that reality does not
> present a hard blocker.

First of all the failure scenario has to be clearly defined.

Right now, if I'm reading the patches correctly this simply can end up
killing the wrong tasks/processes just because an OOM situation results
in a depletion of the per CPU cache and the very wrong task which runs
into the deep call stack situation ends up in the creek without a paddle.

Given that you even fail to abort a CPU bringup when the allocation of
the per CPU stack page cache fails, makes it pretty clear that there has
been spent exactly zero thoughts about this problem.

Why the heck does this cache refill call have to be unconditionally in
__schedule() where preemption is disabled and therefore GFP_ATOMIC
is mandatory? I know "Works for me" (most of the time).

And just because I was looking at the patch in question I found this
other insanity:

> +	/*
> +	 * Most likely we faulted in the page right next to the last mapped
> +	 * page in the stack, however, it is possible (but very unlikely) that
> +	 * the faulted page is actually skips some pages in the stack. Make sure
> +	 * we do not create  more than one holes in the stack, and map every
> +	 * page between the current fault  address and the last page that is
> +	 * mapped in the stack.
> +	 */

Can anyone with a sane mind and the most minimal understanding of the
kernel's inner working explain to me how the kernel can skip "some
pages" on the stack?

If the kernel skips a whole page or more then there is a serious bug
somewhere. I might be missing something, but again the "very unlikely"
wording which handwaves about it is just disgustingly useless.

I disagree with Dave on the RFC status of this series. It's not even
close to RFC, it's at PoC status.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19 21:59               ` Thomas Gleixner
@ 2026-06-20  5:02                 ` David Stevens
  2026-06-20 21:59                   ` Thomas Gleixner
  2026-06-20 19:33                 ` Zach O'Keefe
  1 sibling, 1 reply; 41+ messages in thread
From: David Stevens @ 2026-06-20  5:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Zach O'Keefe, Dave Hansen, H. Peter Anvin, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> Zach!
>
> On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> > While it seems common opinion that the IST-based solution is fragile,
> > what of FRED? It seems like this is exactly the kind of support needed
> > to avoid some of the aforementioned sw "mess" in various x86 exception
> > handling paths. I agree that it's less-than-ideal that we are forced
> > to downgrade exception levels in the common #PF case, but is that an
> > unsurmountable problem? Pardon my ignorance.
>
> The #PF path is considered perfomance critical. But how much the
> downgrade matters needs actual numbers to analyze under various workload
> scenarios.
>
> I've not seen numbers to that effect anywhere. The only numbers provided
> are marketing material about the memory savings on a freshly booted idle
> machine. There are _zero_ numbers about the actual real world savings,
> but claims about the PETABYTE savings possible.
>
> Seriously?
>
> > Lastly, I just want to clarify what folks have meant by "extraordinary
> > claims" or "evidence".  Aside from the above discussion on FRED
> > exception handling, the "only" other part of this is the allocation.
>
> Clearly anything which is explained with "shouldn't happen" and
> "unlikely". At cloud scale nothing is unlikely anymore. That's simply the
> reality of statistical math.
>
> As I pointed out before the same applies to the unexplained
> upgrade/downgrade game with external interrupts. Such issues cannot be
> papered over without understanding the root cause as from decades long
> experience they come inevitably back some time down the road. Cloud
> scale even guarantees that.
>
> > Are people concerned about memory unavailability, deadlocking-type
> > issues, or something else? We have considerable design freedom here to
> > avoid certain classes of unreliability, but—barring any clever
> > tricks—I don't know if the allocation can be guaranteed to succeed in
> > all conceivable circumstances. I want to ensure that reality does not
> > present a hard blocker.
>
> First of all the failure scenario has to be clearly defined.
>
> Right now, if I'm reading the patches correctly this simply can end up
> killing the wrong tasks/processes just because an OOM situation results
> in a depletion of the per CPU cache and the very wrong task which runs
> into the deep call stack situation ends up in the creek without a paddle.
>
> Given that you even fail to abort a CPU bringup when the allocation of
> the per CPU stack page cache fails, makes it pretty clear that there has
> been spent exactly zero thoughts about this problem.
>
> Why the heck does this cache refill call have to be unconditionally in
> __schedule() where preemption is disabled and therefore GFP_ATOMIC
> is mandatory? I know "Works for me" (most of the time).
> And just because I was looking at the patch in question I found this
> other insanity:
>
> > +     /*
> > +      * Most likely we faulted in the page right next to the last mapped
> > +      * page in the stack, however, it is possible (but very unlikely) that
> > +      * the faulted page is actually skips some pages in the stack. Make sure
> > +      * we do not create  more than one holes in the stack, and map every
> > +      * page between the current fault  address and the last page that is
> > +      * mapped in the stack.
> > +      */
>
> Can anyone with a sane mind and the most minimal understanding of the
> kernel's inner working explain to me how the kernel can skip "some
> pages" on the stack?
>
> If the kernel skips a whole page or more then there is a serious bug
> somewhere. I might be missing something, but again the "very unlikely"
> wording which handwaves about it is just disgustingly useless.

FRAME_WARN accepts values up to 8192 bytes, and it can always be
ignored or simply disabled. If a stack frame is larger than 4k, then
it's entirely possible for the code and compiler to align in a way
where the first access in the frame skips a page in the stack. I think
we agree that such code would be highly suspect and (hopefully) would
only exist in out-of-tree drivers. But it's something the kernel build
system accepts today. Dynamic kernel stacks suddenly turning that into
a runtime kernel panic seems like exactly the sort of edge case that
we would get yelled at for not addressing.


-David


> I disagree with Dave on the RFC status of this series. It's not even
> close to RFC, it's at PoC status.
>
> Thanks,
>
>         tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19  0:29       ` Dave Hansen
  2026-06-19 19:56         ` Zach O'Keefe
@ 2026-06-20  5:25         ` David Stevens
  2026-06-20 23:22           ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: David Stevens @ 2026-06-20  5:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Laight, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Thu, Jun 18, 2026 at 5:29 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/24/26 15:26, David Laight wrote:
> >> This true until, in a fleet of millions of machines, you encounter a
> >> one-in-a-billion chance of a stack overflow. You are then forced to
> >> double the statically allocated kernel stacks on every machine, paying a
> >> memory tax even though 99.999..% of threads never exceed 4K. This
> >> overhead accumulates to petabytes of wasted capacity.
> > And then you hit a stack fault in some path where you can't sleep and
> > there isn't any available kernel memory.
> >
> > An alternative idea is to arrange for some system calls to sleep in
> > userspace, so when the thread is woken it re-executes the system call.
> > It then makes sense to assign the kernel stack to the process when
> > it enters the kernel.
>
> There are probably other ways to do this without handling exceptions.
>
> For instance, let's say you always *map* 16k of stack for each process.
> But, after context switching out, you take a look at 4x8b pte_t's that
> were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
> can just clear _PAGE_PRESENT and reclaim the page.
>
> If you don't want the overhead in the normal context switch path, you
> reclaim in a shrinker, at the cost of needing locking to coordinate with
> the scheduler.

My understanding is that speculative execution can fill the TLB, but
won't set access bits. Speculative execution of a function call could
definitely put an apparently unused stack page into the TLB. In
theory, I don't see anything preventing one CPU from speculatively
accessing memory from another CPU's current stack. You definitely
wouldn't want to do TLB shootdowns in the context switch path, so this
would require a shrinker. I guess if you're batching shootdowns in a
shrinker, it's probably not more expensive than swap on a
per-page-freed basis.

> A simple rule would be: a thread that ever accesses a page gets to keep
> it forever. They're never reclaimed after being accessed, only before.
>
> For that, the worst case is that you go to schedule a new thread and
> can't allocate memory fill in the 4 pte_t's. You can't run it until you
> or some other CPU goes and does some reclaim.
>
> Needing memory in the middle of schedule() is generally a no-go. But its
> a lot better than not being able to continue _execution_ of a kernel
> thread at *ALL*, possibly in a non-preemptible context, like when you do
> it in a #PF.

I don't think this is different from the current proposal from a
memory allocation standpoint. Both proposals effectively maintain a
pool of preallocated pages used to fill the current thread's stack.
They vary substantially in when the pages are put into the page
tables, but both need to allocate during schedule().

-David

> Basically, I think there's a way to do this that limits the kernel blast
> radius to _mostly_ being a core mm problem.
>
> What else has been considered before the #PF-based mechanism?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-19 21:59               ` Thomas Gleixner
  2026-06-20  5:02                 ` David Stevens
@ 2026-06-20 19:33                 ` Zach O'Keefe
  2026-06-20 19:44                   ` H. Peter Anvin
  2026-06-20 23:34                   ` Thomas Gleixner
  1 sibling, 2 replies; 41+ messages in thread
From: Zach O'Keefe @ 2026-06-20 19:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, H. Peter Anvin, David Stevens, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:

Thomas, thanks again for taking the time to look into this and help out.

> Zach!
>
> On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> > While it seems common opinion that the IST-based solution is fragile,
> > what of FRED? It seems like this is exactly the kind of support needed
> > to avoid some of the aforementioned sw "mess" in various x86 exception
> > handling paths. I agree that it's less-than-ideal that we are forced
> > to downgrade exception levels in the common #PF case, but is that an
> > unsurmountable problem? Pardon my ignorance.
>
> The #PF path is considered perfomance critical. But how much the
> downgrade matters needs actual numbers to analyze under various workload
> scenarios.

Ya, that's my concern as well, as I don't have a good intuition for
how perf critical kernel #PF is for real workloads. If this is your
primary concern, I'll take that as a _good_ thing ; i.e. there's
nothing architecturally stopping us from doing this downgrade safely.
We'll still need the analysis, but that can be a later stage -- we're
more than happy to get this data for all.

> I've not seen numbers to that effect anywhere. The only numbers provided
> are marketing material about the memory savings on a freshly booted idle
> machine. There are _zero_ numbers about the actual real world savings,
> but claims about the PETABYTE savings possible.
>
> Seriously?

This is actually the most understood aspect. With O(100B) active tasks
fleetwide at any point, it only takes an average savings of O(10KiB)
per task to get to 1PiB. At least for our fleet, we know the % of
tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
math confirms that we expect O(PiB) aggregate savings. The % of stacks
requiring the full 16KiB is minuscule, but it still occurs at a rate
higher than what we can tolerate for SO panics. Given the vast
majority of stacks never exceed the first 4KiB, this enables the
significant opportunity.

> > Lastly, I just want to clarify what folks have meant by "extraordinary
> > claims" or "evidence".  Aside from the above discussion on FRED
> > exception handling, the "only" other part of this is the allocation.
>
> Clearly anything which is explained with "shouldn't happen" and
> "unlikely". At cloud scale nothing is unlikely anymore. That's simply the
> reality of statistical math.
>
> As I pointed out before the same applies to the unexplained
> upgrade/downgrade game with external interrupts. Such issues cannot be
> papered over without understanding the root cause as from decades long
> experience they come inevitably back some time down the road. Cloud
> scale even guarantees that.
>
> > Are people concerned about memory unavailability, deadlocking-type
> > issues, or something else? We have considerable design freedom here to
> > avoid certain classes of unreliability, but—barring any clever
> > tricks—I don't know if the allocation can be guaranteed to succeed in
> > all conceivable circumstances. I want to ensure that reality does not
> > present a hard blocker.
>
> First of all the failure scenario has to be clearly defined.
>
> Right now, if I'm reading the patches correctly this simply can end up
> killing the wrong tasks/processes just because an OOM situation results
> in a depletion of the per CPU cache and the very wrong task which runs
> into the deep call stack situation ends up in the creek without a paddle.
>
> Given that you even fail to abort a CPU bringup when the allocation of
> the per CPU stack page cache fails, makes it pretty clear that there has
> been spent exactly zero thoughts about this problem.
>
> Why the heck does this cache refill call have to be unconditionally in
> __schedule() where preemption is disabled and therefore GFP_ATOMIC
> is mandatory? I know "Works for me" (most of the time).
>
> And just because I was looking at the patch in question I found this
> other insanity:
>
> > +     /*
> > +      * Most likely we faulted in the page right next to the last mapped
> > +      * page in the stack, however, it is possible (but very unlikely) that
> > +      * the faulted page is actually skips some pages in the stack. Make sure
> > +      * we do not create  more than one holes in the stack, and map every
> > +      * page between the current fault  address and the last page that is
> > +      * mapped in the stack.
> > +      */
>
> Can anyone with a sane mind and the most minimal understanding of the
> kernel's inner working explain to me how the kernel can skip "some
> pages" on the stack?
>
> If the kernel skips a whole page or more then there is a serious bug
> somewhere. I might be missing something, but again the "very unlikely"
> wording which handwaves about it is just disgustingly useless.
>
> I disagree with Dave on the RFC status of this series. It's not even
> close to RFC, it's at PoC status.

Absolutely understood. I'm more interested in constructively working
together (as we can see, we'll need your help) to figure out how the
x86 experts want to approach this vs discussing _this_ series. Perhaps
it was my mistake to necro this thread instead of starting a new,
general discussion. Appologies.

To that end, how would you like to proceed? You may understand the x86
complexities better than anyone, so hopefully you can guide this in
the right direction. How would you like us to approach this?

Thanks again for your time, help, and support,
Zach


> Thanks,
>
>         tglx
>
>
>
>
>
>
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-20 19:33                 ` Zach O'Keefe
@ 2026-06-20 19:44                   ` H. Peter Anvin
  2026-06-20 20:01                     ` Zach O'Keefe
  2026-06-20 23:34                   ` Thomas Gleixner
  1 sibling, 1 reply; 41+ messages in thread
From: H. Peter Anvin @ 2026-06-20 19:44 UTC (permalink / raw)
  To: Zach O'Keefe, Thomas Gleixner
  Cc: Dave Hansen, David Stevens, Pasha Tatashin, Linus Walleij,
	Will Deacon, Quentin Perret, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On June 20, 2026 12:33:35 PM PDT, Zach O'Keefe <zokeefe@google.com> wrote:
>On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
>Thomas, thanks again for taking the time to look into this and help out.
>
>> Zach!
>>
>> On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
>> > While it seems common opinion that the IST-based solution is fragile,
>> > what of FRED? It seems like this is exactly the kind of support needed
>> > to avoid some of the aforementioned sw "mess" in various x86 exception
>> > handling paths. I agree that it's less-than-ideal that we are forced
>> > to downgrade exception levels in the common #PF case, but is that an
>> > unsurmountable problem? Pardon my ignorance.
>>
>> The #PF path is considered perfomance critical. But how much the
>> downgrade matters needs actual numbers to analyze under various workload
>> scenarios.
>
>Ya, that's my concern as well, as I don't have a good intuition for
>how perf critical kernel #PF is for real workloads. If this is your
>primary concern, I'll take that as a _good_ thing ; i.e. there's
>nothing architecturally stopping us from doing this downgrade safely.
>We'll still need the analysis, but that can be a later stage -- we're
>more than happy to get this data for all.
>
>> I've not seen numbers to that effect anywhere. The only numbers provided
>> are marketing material about the memory savings on a freshly booted idle
>> machine. There are _zero_ numbers about the actual real world savings,
>> but claims about the PETABYTE savings possible.
>>
>> Seriously?
>
>This is actually the most understood aspect. With O(100B) active tasks
>fleetwide at any point, it only takes an average savings of O(10KiB)
>per task to get to 1PiB. At least for our fleet, we know the % of
>tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>math confirms that we expect O(PiB) aggregate savings. The % of stacks
>requiring the full 16KiB is minuscule, but it still occurs at a rate
>higher than what we can tolerate for SO panics. Given the vast
>majority of stacks never exceed the first 4KiB, this enables the
>significant opportunity.
>
>> > Lastly, I just want to clarify what folks have meant by "extraordinary
>> > claims" or "evidence".  Aside from the above discussion on FRED
>> > exception handling, the "only" other part of this is the allocation.
>>
>> Clearly anything which is explained with "shouldn't happen" and
>> "unlikely". At cloud scale nothing is unlikely anymore. That's simply the
>> reality of statistical math.
>>
>> As I pointed out before the same applies to the unexplained
>> upgrade/downgrade game with external interrupts. Such issues cannot be
>> papered over without understanding the root cause as from decades long
>> experience they come inevitably back some time down the road. Cloud
>> scale even guarantees that.
>>
>> > Are people concerned about memory unavailability, deadlocking-type
>> > issues, or something else? We have considerable design freedom here to
>> > avoid certain classes of unreliability, but—barring any clever
>> > tricks—I don't know if the allocation can be guaranteed to succeed in
>> > all conceivable circumstances. I want to ensure that reality does not
>> > present a hard blocker.
>>
>> First of all the failure scenario has to be clearly defined.
>>
>> Right now, if I'm reading the patches correctly this simply can end up
>> killing the wrong tasks/processes just because an OOM situation results
>> in a depletion of the per CPU cache and the very wrong task which runs
>> into the deep call stack situation ends up in the creek without a paddle.
>>
>> Given that you even fail to abort a CPU bringup when the allocation of
>> the per CPU stack page cache fails, makes it pretty clear that there has
>> been spent exactly zero thoughts about this problem.
>>
>> Why the heck does this cache refill call have to be unconditionally in
>> __schedule() where preemption is disabled and therefore GFP_ATOMIC
>> is mandatory? I know "Works for me" (most of the time).
>>
>> And just because I was looking at the patch in question I found this
>> other insanity:
>>
>> > +     /*
>> > +      * Most likely we faulted in the page right next to the last mapped
>> > +      * page in the stack, however, it is possible (but very unlikely) that
>> > +      * the faulted page is actually skips some pages in the stack. Make sure
>> > +      * we do not create  more than one holes in the stack, and map every
>> > +      * page between the current fault  address and the last page that is
>> > +      * mapped in the stack.
>> > +      */
>>
>> Can anyone with a sane mind and the most minimal understanding of the
>> kernel's inner working explain to me how the kernel can skip "some
>> pages" on the stack?
>>
>> If the kernel skips a whole page or more then there is a serious bug
>> somewhere. I might be missing something, but again the "very unlikely"
>> wording which handwaves about it is just disgustingly useless.
>>
>> I disagree with Dave on the RFC status of this series. It's not even
>> close to RFC, it's at PoC status.
>
>Absolutely understood. I'm more interested in constructively working
>together (as we can see, we'll need your help) to figure out how the
>x86 experts want to approach this vs discussing _this_ series. Perhaps
>it was my mistake to necro this thread instead of starting a new,
>general discussion. Appologies.
>
>To that end, how would you like to proceed? You may understand the x86
>complexities better than anyone, so hopefully you can guide this in
>the right direction. How would you like us to approach this?
>
>Thanks again for your time, help, and support,
>Zach
>
>
>> Thanks,
>>
>>         tglx
>>
>>
>>
>>
>>
>>
>>
>

1 PiB for a fleet only makes sense in the context of the size of that fleet. 

But it's more than that. 

You WILL slow down the general case by this stuff, and so how much actual gain does this imply? What is the mark needed to even get to a break-even point? 

To be honest, this and the multikernel proposal are the worst motivated massive changes for no demonstrated value I have seen in a very, very long time.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-20 19:44                   ` H. Peter Anvin
@ 2026-06-20 20:01                     ` Zach O'Keefe
  0 siblings, 0 replies; 41+ messages in thread
From: Zach O'Keefe @ 2026-06-20 20:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Dave Hansen, David Stevens, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

Hey Hans,

> 1 PiB for a fleet only makes sense in the context of the size of that fleet.

Of course! But that _is_ a real fleet, not a hypothetical one -- and I
don't see too many news articles about hyperscalers downsizing :)

> But it's more than that.
>
> You WILL slow down the general case by this stuff, and so how much actual gain does this imply? What is the mark needed to even get to a break-even point?

You're absolutely right! Per Thomas' earlier comment, performance
analysis is due. This is a chicken-before-the-egg problem: we can't
get the data until we have a PoC to work with. We are looking to
engage with upstream early in the project, to determine what a PoC
ought to look like. It benefits no one to invest time and money doing
the reverse. I'd be more than delighted to get these numbers, once we
have a feasible (if conditional) path forward.

> To be honest, this and the multikernel proposal are the worst motivated massive changes for no demonstrated value I have seen in a very, very long time.

;) Glad I get to be memorable!

Appreciate your thoughts, and have a great weekend,
Zach

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-20  5:02                 ` David Stevens
@ 2026-06-20 21:59                   ` Thomas Gleixner
  0 siblings, 0 replies; 41+ messages in thread
From: Thomas Gleixner @ 2026-06-20 21:59 UTC (permalink / raw)
  To: David Stevens
  Cc: Zach O'Keefe, Dave Hansen, H. Peter Anvin, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Fri, Jun 19 2026 at 22:02, David Stevens wrote:
> On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:
>> If the kernel skips a whole page or more then there is a serious bug
>> somewhere. I might be missing something, but again the "very unlikely"
>> wording which handwaves about it is just disgustingly useless.
>
> FRAME_WARN accepts values up to 8192 bytes, and it can always be
> ignored or simply disabled. If a stack frame is larger than 4k, then

We should limit that to something sane.

> it's entirely possible for the code and compiler to align in a way
> where the first access in the frame skips a page in the stack. I think
> we agree that such code would be highly suspect and (hopefully) would
> only exist in out-of-tree drivers.

We don't care about out of tree drivers.

> But it's something the kernel build system accepts today. Dynamic
> kernel stacks suddenly turning that into a runtime kernel panic seems
> like exactly the sort of edge case that we would get yelled at for not
> addressing.

That's a good thing. If it breaks in tree code then those people have
finally an incentive to fix the warnings sent by the robots which they
ignored for a long time. If it breaks out of tree code then *SHRUG*.

We guarantee not to break user space, but we don't guarantee anything
for out of tree kernel hacks.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-20  5:25         ` David Stevens
@ 2026-06-20 23:22           ` Dave Hansen
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2026-06-20 23:22 UTC (permalink / raw)
  To: David Stevens
  Cc: David Laight, Pasha Tatashin, Linus Walleij, Will Deacon,
	Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On 6/19/26 22:25, David Stevens wrote:
>> Needing memory in the middle of schedule() is generally a no-go. But its
>> a lot better than not being able to continue _execution_ of a kernel
>> thread at *ALL*, possibly in a non-preemptible context, like when you do
>> it in a #PF.
> I don't think this is different from the current proposal from a
> memory allocation standpoint. Both proposals effectively maintain a
> pool of preallocated pages used to fill the current thread's stack.
> They vary substantially in when the pages are put into the page
> tables, but both need to allocate during schedule().

I think you're saying: "Dave, you didn't solve all of our problems for
us." I'd definitely agree. ;)

I thought I wrote it somewhere, but I either deleted it or it got
ignored. I'll repeat: this PoC series has two big, big sticking points:

 1. It requires allocation in very sticky contexts. It's theoretically
    any code that pushes on the stack. That's a *LOT* of the kernel.
    An allocation failure pretty much means the CPU thread is stuck.
 2. Because those pushes happen almost anywhere, a #PF can happen almost
    anywhere, which widens the places #PF needs to be handled. Thus, the
    angst from the x86 maintainers.

I think I've at least hand-waved a potential path to getting rid of
sticking point #2 in its entirety, and reducing the x86 maintainer angst.

My hand waving also reduces the scope of #1. It removes the need to
allocate memory in some crazy interrupt-disabled region in the I/O
driver interrupt handler holding a bunch of locks when a #MC happens
during an NMI while kswapd was running.

So, yeah "both need to allocate during schedule()" is factually correct.
But this PoC needs to allocate successfully *EVERYWHERE*. Virtually all
kernel code paths, modulo some very very special areas.

Are you saying that as an engineering principle you see needing to
guarantee allocation success of 12k at "virtually all kernel code paths"
and "schedule()" as equivalent barriers to solving the problem at hand
because they're both non-zero in size?

I suspect not. But it's kinda coming off that way. A bit of coaching for
dealing with grumpy time-constrained maintainers: if they take their
time to help you solve their problem, don't spend undue effort pointing
out the engineering compromises in their proposals. Take more time to
consider the engineering tradeoffs as opposed to simply arguing a lack
of utter perfection.

But, really, my big takeaway from this thread is that the folks pushing
dynamic kernel stacks have a very limited understanding of upstream or
what its priorities are. Probably the single biggest obstacle here is
going to be proving to the long-term maintainers that this isn't another
dump and run operation. I suspect the x86 folks are going to be a bit
more amenable in that territory than our mm friends. <cough>MGLRU<cough>

Either way, welcome to the party! If you want to come help upstream,
there are always patches to review and always bugs to fix.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
  2026-06-20 19:33                 ` Zach O'Keefe
  2026-06-20 19:44                   ` H. Peter Anvin
@ 2026-06-20 23:34                   ` Thomas Gleixner
  1 sibling, 0 replies; 41+ messages in thread
From: Thomas Gleixner @ 2026-06-20 23:34 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Dave Hansen, H. Peter Anvin, David Stevens, Pasha Tatashin,
	Linus Walleij, Will Deacon, Quentin Perret, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
	linux-kernel, linux-mm

On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
> On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:
>> The #PF path is considered perfomance critical. But how much the
>> downgrade matters needs actual numbers to analyze under various workload
>> scenarios.
>
> Ya, that's my concern as well, as I don't have a good intuition for
> how perf critical kernel #PF is for real workloads. If this is your
> primary concern, I'll take that as a _good_ thing ; i.e. there's
> nothing architecturally stopping us from doing this downgrade safely.
> We'll still need the analysis, but that can be a later stage -- we're
> more than happy to get this data for all.

No. That's not a later stage optional requirement.

You have a PoC which works for you otherwise you wouldn't have posted
it. So you can trivially microbenchmark the costs of the
up/downgrade. And that's critical information for us but also for
you. If the costs are significant then you really have to think about
the tradeoffs.

Care to read Documentation/process/* carefully? It applies to you as it
applies to anyone else.

>> I've not seen numbers to that effect anywhere. The only numbers provided
>> are marketing material about the memory savings on a freshly booted idle
>> machine. There are _zero_ numbers about the actual real world savings,
>> but claims about the PETABYTE savings possible.
>>
>> Seriously?
>
> This is actually the most understood aspect. With O(100B) active tasks
> fleetwide at any point, it only takes an average savings of O(10KiB)
> per task to get to 1PiB. At least for our fleet, we know the % of
> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
> math confirms that we expect O(PiB) aggregate savings. The % of stacks
> requiring the full 16KiB is minuscule, but it still occurs at a rate
> higher than what we can tolerate for SO panics. Given the vast
> majority of stacks never exceed the first 4KiB, this enables the
> significant opportunity.

I know that the potential savings are well understood and my
understanding of math is sufficient to calculate how much tasks and
average saving it takes to save 1PiB on a fleet.

That's a no-brainer, but this is an aggregate saving, which sounds WOW
but does not tell much about anything else.

 1) What's the actual percentage of savings in relation to the overall
    memory?

 2) Does the saving allow you to get more stuff done on a machine, pack
    more threads on it?

 3) Can you actually downsize the memory on the machines?

 4) What is the performance tradeoff for that?

IOW, you fail to tell what the actual benefit of such an intrusive
change is. Just boasting an aggregate Petabyte number does not tell
anything at all.

Let me give you a trivial example with a scenario which I have access
to:

    256  CPUs
    256  GiB Memory
    64k  Threads

Let's assume the full saving of 12k per thread. That sums up to

      64k * 12k = 768MB of memory

which is 0.29% of the total 256 GiB of memory. Not so impressive as the
petabyte aggregate number, right?

The workload consumes about 80% of the overall memory and is already
constraint on close to 100% CPU utilization.

Now let's assume that the runtime overhead of this amounts to 1% then
this is a net loss.

Let me turn that around and use a made up example assuming the 1Mio
threads per compute unit taken from some reply in this thread.

Now the full saving of 12k per thread amounts to:

    1M * 12k = 12G

which is 4.7% of the overall available memory. Agreed that's a
substantial number.

That 12G saving does not do anything in terms of hardware downsizing.

The only way that has a benefit is when the system is constraint by
overall memory consumption, but has quite some compute capacity left.

IOW, if 1M threads hit the memory limit that means that the savings in
kernel stack consumed memory allows you to add about 4% (~40k) more
threads. If that ups the CPU utilization accordingly then yes, I can see
the benefit. But TBH, if that's the case then you are trying to fix a
user space implementation problem in the kernel.

That said you really have to describe the scenarios where there is a
benefit and I do not buy this "fleet level" argument at all because
there is no single fleet which has a uniform workload distribution.

Aside of that. If your argument holds that there are only a few
scenarios which require a deep stack, then we are better off to identify
them and fix them up rather than trying to hack around the occacional
insanity of deep stack usage by adding complexity for complexity sake.

As you say that you have numbers of your fleet which confirm that the
vast majority of the stack depth is below 4k, you can surely figure out
the information which call chains are actually exceeding the limit.

I prefer to fix such shitty code and downgrade the stacksize in general
instead of papering over the underlying issues which probably have been
ignored for years if not decades.

Have you ever thought about that instead of adding complexity with a
dubious value?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2026-06-20 23:34 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-06-19  0:29       ` Dave Hansen
2026-06-19 19:56         ` Zach O'Keefe
2026-06-20  5:25         ` David Stevens
2026-06-20 23:22           ` Dave Hansen
2026-04-25  9:19   ` H. Peter Anvin
2026-04-27 16:17     ` Dave Hansen
2026-06-18 14:50       ` Zach O'Keefe
2026-06-18 18:53         ` Dave Hansen
2026-06-18 22:28           ` H. Peter Anvin
2026-06-19  0:40             ` David Stevens
2026-06-19  0:44               ` H. Peter Anvin
2026-06-19 12:45           ` Thomas Gleixner
2026-06-19 19:20             ` Zach O'Keefe
2026-06-19 21:59               ` Thomas Gleixner
2026-06-20  5:02                 ` David Stevens
2026-06-20 21:59                   ` Thomas Gleixner
2026-06-20 19:33                 ` Zach O'Keefe
2026-06-20 19:44                   ` H. Peter Anvin
2026-06-20 20:01                     ` Zach O'Keefe
2026-06-20 23:34                   ` Thomas Gleixner
2026-04-27 16:31     ` Pasha Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox