* [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
` (12 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
In many places number of pages in the stack is detremined via
(THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
(THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.
However, with dynamic stacks, the number of pages in vm_area will grow
with stack, therefore, use vm_area->nr_pages to determine the actual
number of pages allocated in stack.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, also skipped intermediary helper variable nr_pages]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David Stevens <stevensd@google.com>
---
kernel/fork.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index bc2bf58b93b6..8961b895bf05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -312,9 +312,7 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
int ret;
int nr_charged = 0;
- BUG_ON(vm_area->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+ for (i = 0; i < vm_area->nr_pages; i++) {
ret = memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0);
if (ret)
goto err;
@@ -484,7 +482,7 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
struct vm_struct *vm_area = task_stack_vm_area(tsk);
int i;
- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ for (i = 0; i < vm_area->nr_pages; i++)
mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
account * (PAGE_SIZE / 1024));
} else {
@@ -505,7 +503,7 @@ void exit_task_stack_account(struct task_struct *tsk)
int i;
vm_area = task_stack_vm_area(tsk);
- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ for (i = 0; i < vm_area->nr_pages; i++)
memcg_kmem_uncharge_page(vm_area->pages[i], 0);
}
}
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
` (11 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
In preparation for dynamic kernel stacks, don't assume that
vm_area->nr_pages matches THREAD_SIZE when clearing a stack for reuse.
Signed-off-by: David Stevens <stevensd@google.com>
---
kernel/fork.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 8961b895bf05..50772c0cc5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -332,6 +332,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
vm_area = alloc_thread_stack_node_from_cache(tsk, node);
if (vm_area) {
+ unsigned long memset_offset = 0;
+
if (memcg_charge_kernel_stack(vm_area)) {
vfree(vm_area->addr);
return -ENOMEM;
@@ -343,7 +345,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
stack = kasan_reset_tag(vm_area->addr);
/* Clear stale pointers from reused stack. */
- memset(stack, 0, THREAD_SIZE);
+ if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
+ memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
+ memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
tsk->stack_vm_area = vm_area;
tsk->stack = stack;
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
` (10 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
The vm_stack struct used to free stacks via an RCU callback is stored
directly in the stack being freed. Make sure it's stored at the
beginning of the stack regardless of stack growth direction, to avoid
faults on partially allocated dynamic stacks.
Signed-off-by: David Stevens <stevensd@google.com>
---
kernel/fork.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 50772c0cc5da..72c081db492c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -282,7 +282,12 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
static void thread_stack_delayed_free(struct task_struct *tsk)
{
- struct vm_stack *vm_stack = tsk->stack;
+ struct vm_stack *vm_stack;
+
+ if (IS_ENABLED(CONFIG_STACK_GROWSUP))
+ vm_stack = tsk->stack;
+ else
+ vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
vm_stack->stack_vm_area = tsk->stack_vm_area;
call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 04/13] fork: separate vmap stack allocation and free calls
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (2 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
` (9 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
In preparation for the dynamic stacks, separate out the
__vmalloc_node_range and vfree calls from the vmap based stack
allocations. The dynamic stacks will use their own variants of these
functions.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Fix a bug in original patch: free_vmap_stack(vm_area->addr)]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Add missing free_vmap_stack conversion, fix typos, rebase]
Signed-off-by: David Stevens <stevensd@google.com>
---
kernel/fork.c | 40 ++++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 72c081db492c..8bf32815f422 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -269,6 +269,21 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
return false;
}
+static inline struct vm_struct *alloc_vmap_stack(int node)
+{
+ void *stack;
+
+ stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+ node, __builtin_return_address(0));
+
+ return stack ? find_vm_area(stack) : NULL;
+}
+
+static inline void free_vmap_stack(struct vm_struct *vm_area)
+{
+ vfree(vm_area->addr);
+}
+
static void thread_stack_free_rcu(struct rcu_head *rh)
{
struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
@@ -277,7 +292,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
return;
- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
}
static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -304,7 +319,7 @@ static int free_vm_stack_cache(unsigned int cpu)
if (!vm_area)
continue;
- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
cached_vm_stack_areas[i] = NULL;
}
@@ -333,41 +348,35 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm_area)
static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
struct vm_struct *vm_area;
- void *stack;
vm_area = alloc_thread_stack_node_from_cache(tsk, node);
if (vm_area) {
unsigned long memset_offset = 0;
if (memcg_charge_kernel_stack(vm_area)) {
- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
return -ENOMEM;
}
/* Reset stack metadata. */
kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-
- stack = kasan_reset_tag(vm_area->addr);
+ tsk->stack = kasan_reset_tag(vm_area->addr);
/* Clear stale pointers from reused stack. */
if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
- memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+ memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
tsk->stack_vm_area = vm_area;
- tsk->stack = stack;
return 0;
}
- stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
- GFP_VMAP_STACK,
- node, __builtin_return_address(0));
- if (!stack)
+ vm_area = alloc_vmap_stack(node);
+ if (!vm_area)
return -ENOMEM;
- vm_area = find_vm_area(stack);
if (memcg_charge_kernel_stack(vm_area)) {
- vfree(stack);
+ free_vmap_stack(vm_area);
return -ENOMEM;
}
/*
@@ -376,8 +385,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
* so cache the vm_struct.
*/
tsk->stack_vm_area = vm_area;
- stack = kasan_reset_tag(stack);
- tsk->stack = stack;
+ tsk->stack = kasan_reset_tag(vm_area->addr);
return 0;
}
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (3 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
` (8 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
get_vm_area_node()
Unlike the other public get_vm_area_* variants, this one accepts node
from which to allocate data structure, and also the align, which allows
to create vm area with a specific alignment.
This call is going to be used by dynamic stacks in order to ensure that
the stack VM area of a specific alignment, and that even if there is
only one page mapped, no page table allocations are going to be needed
to map the other stack pages.
vmap_pages_range()
We will need it from kernel/fork.c in order to map the initial stack
pages, so export the function and add a forward declaration of this
function to the linux/vmalloc.h header.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Switched to vmap_pages_range instead of noflush variant, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
include/linux/vmalloc.h | 14 ++++++++++++++
mm/vmalloc.c | 25 +++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e8e94f90d686..7b56a0b998ab 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -250,6 +250,9 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
unsigned long flags,
unsigned long start, unsigned long end,
const void *caller);
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+ unsigned long flags, int node, gfp_t gfp,
+ const void *caller);
void free_vm_area(struct vm_struct *area);
extern struct vm_struct *remove_vm_area(const void *addr);
extern struct vm_struct *find_vm_area(const void *addr);
@@ -301,11 +304,22 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
}
+
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift);
+
#else /* !CONFIG_MMU */
#define VMALLOC_TOTAL 0UL
static inline unsigned long vmalloc_nr_pages(void) { return 0; }
static inline void set_vm_flush_reset_perms(void *addr) {}
+static inline
+int __must_check vmap_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_MMU */
#if defined(CONFIG_MMU) && defined(CONFIG_SMP)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a4402..39b7e118cbce 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -722,6 +722,7 @@ int vmap_pages_range(unsigned long addr, unsigned long end,
{
return __vmap_pages_range(addr, end, prot, pages, page_shift, GFP_KERNEL);
}
+EXPORT_SYMBOL_GPL(vmap_pages_range);
static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
unsigned long end)
@@ -3285,6 +3286,30 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
NUMA_NO_NODE, GFP_KERNEL, caller);
}
+/**
+ * get_vm_area_node - reserve a contiguous and aligned kernel virtual area
+ * @size: size of the area
+ * @align: alignment of the start address of the area
+ * @flags: %VM_IOREMAP for I/O mappings
+ * @node: NUMA node from which to allocate the area data structure
+ * @gfp: Flags to pass to the allocator
+ * @caller: Caller to be stored in the vm area data structure
+ *
+ * Search for an area of @size/align in the kernel virtual mapping area and
+ * reserve it for our purposes. Returns the area descriptor on success or %NULL
+ * on failure.
+ *
+ * Return: the area descriptor on success or %NULL on failure.
+ */
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+ unsigned long flags, int node, gfp_t gfp,
+ const void *caller)
+{
+ return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ node, gfp, caller);
+}
+
/**
* find_vm_area - find a continuous kernel virtual area
* @addr: base address
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 06/13] fork: Move vmap stack freeing to work queue
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (4 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
` (7 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
For vmap stacks not immediately released into the stack cache, free them
in a workqueue instead of via call_rcu(). In an RCU context, vfree
already schedules the actual freeing on the per-cpu system workqueue, so
this change only affects when exactly the second attempt to put the
stack into the stack cache occurs.
Moving freeing to a workqueue will allow for freeing dynamic stacks in a
sleepable context (for remove_vm_area), rather than relying on vfree
dispatching to a workqueue via vfree_atomic.
Signed-off-by: David Stevens <stevensd@google.com>
---
kernel/fork.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 8bf32815f422..01e0bf4f4b02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -205,7 +205,7 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
struct vm_stack {
- struct rcu_head rcu;
+ struct rcu_work work;
struct vm_struct *stack_vm_area;
};
@@ -284,9 +284,9 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
vfree(vm_area->addr);
}
-static void thread_stack_free_rcu(struct rcu_head *rh)
+static void thread_stack_free_work(struct work_struct *work)
{
- struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+ struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
struct vm_struct *vm_area = vm_stack->stack_vm_area;
if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
@@ -305,7 +305,8 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
vm_stack->stack_vm_area = tsk->stack_vm_area;
- call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+ INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
+ queue_rcu_work(system_wq, &vm_stack->work);
}
static int free_vm_stack_cache(unsigned int cpu)
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 07/13] fork: Dynamic Kernel Stacks
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (5 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
` (6 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
The core implementation of dynamic kernel stacks.
Unlike traditional kernel stacks, these stacks auto-grow as they are
used. This allows to save a significant amount of memory in the fleet
environments. Also, potentially the default size of kernel thread can be
increased in order to prevent stack overflows without compromising on
the overall memory overhead.
The dynamic kernel stacks interface provides two global functions:
1. dynamic_stack_fault().
Architectures that support dynamic kernel stacks, must call this function
in order to handle the fault in the stack.
It allocates and maps new pages into the stack. The pages are
maintained in a per-cpu data structure.
2. dynamic_stack()
Must be called as a thread leaving CPU to check if the thread has
allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set.
If this is the case, there are two things need to be performed:
a. Charge the thread for the allocated stack pages.
b. refill the per-cpu array so the next thread can also fault.
Dynamic kernel threads do not support "STACK_END_MAGIC", as the last
page does not have to be faulted in. However, since they are based off
vmap stacks, the guard pages always protect the dynamic kernel stacks
from overflow.
The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.
Therefore, the numbers should be tested for a specific workload. From
my tests I found the following values on a freshly booted idling
machines:
CPU #Cores #Stacks Regular(kb) Dynamic(kb)
AMD Genoa 384 5786 92576 23388
Intel Skylake 112 3182 50912 12860
AMD Rome 128 3401 54416 14784
AMD Rome 256 4908 78528 20876
Intel Haswell 72 2644 42304 10624
On all machines dynamic kernel stacks take about 25% of the original
stack memory. Only 5% of active tasks performed a stack page fault in
their life cycles.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, used vm_area->nr_pages directly in one instance]
[Depends on !PREEMPT_RT]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Fix races around accounting]
[Use GFP_ATOMIC when executing in the scheduler]
[Depend on INIT_STACK_ALL_* config]
[Fix bugs in some error paths and edge cases]
[Don't cache partially faulted stacks]
[Added out-var to tell if address is on target stack]
Signed-off-by: David Stevens <stevensd@google.com>
---
arch/Kconfig | 39 ++++
include/linux/sched.h | 11 +-
include/linux/sched/task_stack.h | 47 +++-
init/init_task.c | 4 +
kernel/fork.c | 357 +++++++++++++++++++++++++++++--
kernel/sched/core.c | 1 +
6 files changed, 439 insertions(+), 20 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298e..95ded79f0825 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1515,6 +1515,45 @@ config VMAP_STACK
backing virtual mappings with real shadow memory, and KASAN_VMALLOC
must be enabled.
+config HAVE_ARCH_DYNAMIC_STACK
+ def_bool n
+ help
+ An arch should select this symbol if it can support kernel stacks
+ that grow dynamically.
+
+ - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
+ stack related page faults.
+
+ - Arch must be able to fault from interrupt context.
+
+ - Arch must allow the kernel to handle stack faults gracefully, even
+ during interrupt handling.
+
+ - Exceptions such as no pages available should be handled the same
+ in the consistent and predictable way. I.e. the exception should be
+ handled the same as when stack overflow occurs when guard pages are
+ touched with extra information about the allocation error.
+
+config DYNAMIC_STACK
+ default y
+ bool "Dynamically grow kernel stacks"
+ depends on THREAD_INFO_IN_TASK
+ depends on HAVE_ARCH_DYNAMIC_STACK
+ depends on VMAP_STACK
+ depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
+ depends on !KASAN
+ depends on !DEBUG_STACK_USAGE
+ depends on !STACK_GROWSUP
+ depends on !PREEMPT_RT
+ help
+ Dynamic kernel stacks allow to save memory on machines with a lot of
+ threads by starting with small stacks, and grow them only when needed.
+ On workloads where most of the stack depth do not reach over one page
+ the memory saving can be substantial. The feature requires virtually
+ mapped kernel stacks in order to handle page faults. It requires stack
+ initialization to preclude one thread from faulting on another thread's
+ stack.
+
config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
def_bool n
help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf..7aa06233afd5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,11 @@ struct task_struct {
*/
randomized_struct_fields_start
+#ifdef CONFIG_DYNAMIC_STACK
+ unsigned long packed_stack;
+#else
void *stack;
+#endif
refcount_t usage;
/* Per task flags (PF_*), defined further below: */
unsigned int flags;
@@ -1563,6 +1567,11 @@ struct task_struct {
struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
+ /*
+ * We can't call find_vm_area() in interrupt context, and
+ * free_thread_stack() can be called in interrupt context,
+ * so cache the vm_struct.
+ */
struct vm_struct *stack_vm_area;
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK
@@ -1773,7 +1782,7 @@ extern struct pid *cad_pid;
* I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
-#define PF__HOLE__00800000 0x00800000
+#define PF_DYNAMIC_STACK 0x00800000 /* This thread allocated dynamic stack pages */
#define PF__HOLE__01000000 0x01000000
#define PF__HOLE__02000000 0x02000000
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 1fab7e9043a3..7dcff2836d7e 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -13,6 +13,10 @@
#ifdef CONFIG_THREAD_INFO_IN_TASK
+#ifdef CONFIG_DYNAMIC_STACK
+#define DYNAMIC_STACK_MAX_ACCOUNT_MASK ((1 << (THREAD_SIZE_ORDER + 1)) - 1)
+#endif
+
/*
* When accessing the stack of a non-current task that might exit, use
* try_get_task_stack() instead. task_stack_page will return a pointer
@@ -20,7 +24,11 @@
*/
static __always_inline void *task_stack_page(const struct task_struct *task)
{
+#ifdef CONFIG_DYNAMIC_STACK
+ return (void *)(task->packed_stack & ~DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+#else
return task->stack;
+#endif
}
#define setup_thread_stack(new,old) do { } while(0)
@@ -30,7 +38,7 @@ static __always_inline unsigned long *end_of_stack(const struct task_struct *tas
#ifdef CONFIG_STACK_GROWSUP
return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1;
#else
- return task->stack;
+ return task_stack_page(task);
#endif
}
@@ -83,9 +91,45 @@ static inline void put_task_stack(struct task_struct *tsk) {}
void exit_task_stack_account(struct task_struct *tsk);
+#ifdef CONFIG_DYNAMIC_STACK
+
+#define task_stack_end_corrupted(task) 0
+
+#ifndef THREAD_PREALLOC_PAGES
+#define THREAD_PREALLOC_PAGES 1
+#endif
+
+#define THREAD_DYNAMIC_PAGES \
+ ((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES)
+
+void dynamic_stack_refill_pages(void);
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
+bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+
+/*
+ * Refill and charge for the used pages.
+ */
+static inline void dynamic_stack(struct task_struct *tsk)
+{
+ if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) {
+ dynamic_stack_refill_pages();
+ dynamic_stack_accounting(tsk, false);
+ tsk->flags &= ~PF_DYNAMIC_STACK;
+ }
+}
+
+static inline void set_task_stack_end_magic(struct task_struct *tsk) {}
+
+#else /* !CONFIG_DYNAMIC_STACK */
+
#define task_stack_end_corrupted(task) \
(*(end_of_stack(task)) != STACK_END_MAGIC)
+void set_task_stack_end_magic(struct task_struct *tsk);
+static inline void dynamic_stack(struct task_struct *tsk) {}
+
+#endif /* CONFIG_DYNAMIC_STACK */
+
static inline int object_is_on_stack(const void *obj)
{
void *stack = task_stack_page(current);
@@ -104,7 +148,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
return 0;
}
#endif
-extern void set_task_stack_end_magic(struct task_struct *tsk);
static inline int kstack_end(void *addr)
{
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..e3645ec4ab02 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -99,7 +99,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.stack_refcount = REFCOUNT_INIT(1),
#endif
.__state = 0,
+#ifdef CONFIG_DYNAMIC_STACK
+ .packed_stack = (unsigned long)init_stack,
+#else
.stack = init_stack,
+#endif
.usage = REFCOUNT_INIT(2),
.flags = PF_KTHREAD,
.prio = MAX_PRIO - 20,
diff --git a/kernel/fork.c b/kernel/fork.c
index 01e0bf4f4b02..e615ef736dc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,10 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
* accounting is performed by the code assigning/releasing stacks to tasks.
* We need a zeroed memory without __GFP_ACCOUNT.
*/
-#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO)
+static gfp_t vmap_stack_gfp(bool is_atomic)
+{
+ return (is_atomic ? GFP_ATOMIC : GFP_KERNEL) | __GFP_ZERO;
+}
struct vm_stack {
struct rcu_work work;
@@ -241,6 +244,18 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
unsigned int i;
int nid;
+#ifdef CONFIG_DYNAMIC_STACK
+ /*
+ * Skip the cache for populated dynamic stacks to avoid punishing a
+ * memcg with a larger charge just because it happened to pick up a
+ * dynamic stack that's been partially faulted in. We may get a lower
+ * number of cache hits, but stacks with dynamically faulted pages
+ * should be fairly uncommon.
+ */
+ if (vm_area->nr_pages != THREAD_PREALLOC_PAGES)
+ return false;
+#endif /* CONFIG_DYNAMIC_STACK */
+
/*
* Don't cache stacks if any of the pages don't match the local domain, unless
* there is no local memory to begin with.
@@ -269,11 +284,285 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
return false;
}
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * There is a window between when a thread refills the page pool and when it
+ * actually gets scheduled out where it can still consume pages from the pool.
+ * To guarantee the next thread has enough pages to fully populate its stack,
+ * double the size of the page pool.
+ */
+#define DYNSTK_PAGE_POOL_NR (THREAD_DYNAMIC_PAGES * 2)
+
+static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
+
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+ tsk->stack_vm_area = vm_area;
+ tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+}
+
+static void free_vmap_stack(struct vm_struct *vm_area)
+{
+ int i;
+
+ remove_vm_area(vm_area->addr);
+
+ for (i = 0; i < vm_area->nr_pages; i++)
+ __free_page(vm_area->pages[i]);
+
+ kfree(vm_area->pages);
+ kfree(vm_area);
+}
+
+static struct vm_struct *alloc_vmap_stack(int node)
+{
+ gfp_t gfp = vmap_stack_gfp(false);
+ unsigned long addr, end;
+ struct vm_struct *vm_area;
+ int err, i;
+
+ /*
+ * Paranoid check to guarantee we never straddle a page table, so
+ * that virt_to_kpte() is always valid in dynamic_stack_fault().
+ */
+ BUILD_BUG_ON((PMD_SIZE % THREAD_SIZE) || (THREAD_ALIGN % THREAD_SIZE));
+
+ vm_area = get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node,
+ gfp, __builtin_return_address(0));
+ if (!vm_area)
+ return NULL;
+
+ vm_area->pages = kmalloc_node(sizeof(void *) *
+ (THREAD_SIZE >> PAGE_SHIFT), gfp, node);
+ if (!vm_area->pages)
+ goto cleanup_err;
+
+ for (i = 0; i < THREAD_PREALLOC_PAGES; i++) {
+ vm_area->pages[i] = alloc_pages(gfp, 0);
+ if (!vm_area->pages[i])
+ goto cleanup_err;
+ vm_area->nr_pages++;
+ }
+
+ addr = (unsigned long)vm_area->addr +
+ (THREAD_DYNAMIC_PAGES << PAGE_SHIFT);
+ end = (unsigned long)vm_area->addr + THREAD_SIZE;
+ err = vmap_pages_range(addr, end, PAGE_KERNEL, vm_area->pages, PAGE_SHIFT);
+ if (err)
+ goto cleanup_err;
+
+ return vm_area;
+cleanup_err:
+ free_vmap_stack(vm_area);
+ return NULL;
+}
+
+static struct page *noinstr dynamic_stack_get_page(void)
+{
+ struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+ int i;
+
+ for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+ struct page *page = pages[i];
+
+ if (!page)
+ continue;
+ pages[i] = NULL;
+ return page;
+ }
+
+ return NULL;
+}
+
+static int dynamic_stack_refill_pages_cpu(unsigned int cpu)
+{
+ struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+ int i;
+
+ for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+ if (pages[i])
+ continue;
+ pages[i] = alloc_pages(vmap_stack_gfp(false), 0);
+ if (unlikely(!pages[i])) {
+ pr_err("failed to allocate dynamic stack page for cpu[%d]\n",
+ cpu);
+ break;
+ }
+ }
+
+ return 0;
+}
+
+static int dynamic_stack_free_pages_cpu(unsigned int cpu)
+{
+ struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+ int i;
+
+ for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+ if (!pages[i])
+ continue;
+ __free_page(pages[i]);
+ pages[i] = NULL;
+ }
+
+ return 0;
+}
+
+void dynamic_stack_refill_pages(void)
+{
+ struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+ int i;
+
+ for (i = 0; i < DYNSTK_PAGE_POOL_NR; i++) {
+ struct page *page = pages[i];
+
+ if (page)
+ continue;
+
+ /*
+ * This is called during context switch, so we can't take any
+ * sleeping locks. As such, we need to use GFP_ATOMIC.
+ */
+ page = alloc_pages(vmap_stack_gfp(true), 0);
+ if (unlikely(!page))
+ pr_err_ratelimited("failed to refill per-cpu dynamic stack\n");
+ pages[i] = page;
+ }
+}
+
+unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
+{
+ struct vm_struct *vm_area = tsk->stack_vm_area;
+ unsigned long nr_accounted, i;
+
+ cant_sleep();
+
+ /* Verify enough low order bits in the page-aligned stack pointer. */
+ BUILD_BUG_ON(THREAD_PREALLOC_PAGES == 0 ||
+ PAGE_SIZE - 1 <= DYNAMIC_STACK_MAX_ACCOUNT_MASK);
+
+ nr_accounted = tsk->packed_stack & DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+
+ if (nr_accounted == DYNAMIC_STACK_MAX_ACCOUNT_MASK) {
+ WARN_ON_ONCE(finalize);
+ return 0;
+ }
+
+ for (i = THREAD_PREALLOC_PAGES + nr_accounted; i < vm_area->nr_pages; i++) {
+ struct page *page = vm_area->pages[i];
+
+ int ret = memcg_kmem_charge_page(page, GFP_ATOMIC, 0);
+ /*
+ * XXX Since stack pages were already allocated, we should never
+ * fail charging. Therefore, we should probably induce force
+ * charge and oom killing if charge fails.
+ */
+ if (unlikely(ret))
+ pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n");
+
+ mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
+ PAGE_SIZE / 1024);
+ }
+
+ if (finalize) {
+ tsk->packed_stack |= DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+ } else {
+ tsk->packed_stack &= ~DYNAMIC_STACK_MAX_ACCOUNT_MASK;
+ tsk->packed_stack |= (i - THREAD_PREALLOC_PAGES);
+ }
+
+ return i;
+}
+
+bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
+{
+ unsigned long stack, hole_end, addr;
+ struct vm_struct *vm_area;
+ struct page *page;
+ int nr_pages;
+ pte_t *pte;
+
+ cant_sleep();
+
+ if (WARN_ON(in_nmi())) {
+ *on_stack = false;
+ return false;
+ }
+
+ /* check if address is inside the kernel stack area */
+ stack = (unsigned long)task_stack_page(tsk);
+ if (address < stack || address >= stack + THREAD_SIZE) {
+ *on_stack = false;
+ return false;
+ }
+ *on_stack = true;
+
+ vm_area = tsk->stack_vm_area;
+ if (WARN_ON_ONCE(!vm_area))
+ return false;
+
+ nr_pages = vm_area->nr_pages;
+
+ /* Check if fault address is within the stack hole */
+ hole_end = stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT);
+ if (address >= hole_end)
+ return false;
+
+ /*
+ * Most likely we faulted in the page right next to the last mapped
+ * page in the stack, however, it is possible (but very unlikely) that
+ * the faulted page is actually skips some pages in the stack. Make sure
+ * we do not create more than one holes in the stack, and map every
+ * page between the current fault address and the last page that is
+ * mapped in the stack.
+ */
+ address = PAGE_ALIGN_DOWN(address);
+ for (addr = hole_end - PAGE_SIZE; addr >= address; addr -= PAGE_SIZE) {
+ /* Take the next page from the per-cpu list */
+ page = dynamic_stack_get_page();
+ if (!page) {
+ instrumentation_begin();
+ pr_emerg("Failed to allocate a page during kernel_stack_fault\n");
+ instrumentation_end();
+ return false;
+ }
+
+ /* Add the new page entry to the page table */
+ pte = virt_to_kpte(addr);
+ if (!pte) {
+ instrumentation_begin();
+ pr_emerg("The PTE page table for a kernel stack is not found\n");
+ instrumentation_end();
+ return false;
+ }
+
+ /* Make sure there are no existing mappings at this address */
+ if (pte_present(*pte)) {
+ instrumentation_begin();
+ pr_emerg("The PTE contains a mapping\n");
+ instrumentation_end();
+ return false;
+ }
+ set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+
+ /* Store the new page in the stack's vm_area */
+ vm_area->pages[nr_pages] = page;
+ vm_area->nr_pages = ++nr_pages;
+ }
+
+ /* Refill the pcp stack pages during context switch */
+ tsk->flags |= PF_DYNAMIC_STACK;
+
+ return true;
+}
+
+#else /* !CONFIG_DYNAMIC_STACK */
static inline struct vm_struct *alloc_vmap_stack(int node)
{
void *stack;
- stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK,
+ stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, vmap_stack_gfp(false),
node, __builtin_return_address(0));
return stack ? find_vm_area(stack) : NULL;
@@ -284,6 +573,13 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
vfree(vm_area->addr);
}
+static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
+{
+ tsk->stack_vm_area = vm_area;
+ tsk->stack = kasan_reset_tag(vm_area->addr);
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
static void thread_stack_free_work(struct work_struct *work)
{
struct vm_stack *vm_stack = container_of(to_rcu_work(work), struct vm_stack, work);
@@ -300,9 +596,9 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
struct vm_stack *vm_stack;
if (IS_ENABLED(CONFIG_STACK_GROWSUP))
- vm_stack = tsk->stack;
+ vm_stack = task_stack_page(tsk);
else
- vm_stack = tsk->stack + THREAD_SIZE - sizeof(*vm_stack);
+ vm_stack = task_stack_page(tsk) + THREAD_SIZE - sizeof(*vm_stack);
vm_stack->stack_vm_area = tsk->stack_vm_area;
INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work);
@@ -361,14 +657,13 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
/* Reset stack metadata. */
kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
- tsk->stack = kasan_reset_tag(vm_area->addr);
+ link_vmap_stack_to_task(tsk, vm_area);
/* Clear stale pointers from reused stack. */
if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
memset_offset = THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE;
- memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
+ memset(task_stack_page(tsk) + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE);
- tsk->stack_vm_area = vm_area;
return 0;
}
@@ -380,22 +675,20 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
free_vmap_stack(vm_area);
return -ENOMEM;
}
- /*
- * We can't call find_vm_area() in interrupt context, and
- * free_thread_stack() can be called in interrupt context,
- * so cache the vm_struct.
- */
- tsk->stack_vm_area = vm_area;
- tsk->stack = kasan_reset_tag(vm_area->addr);
+ link_vmap_stack_to_task(tsk, vm_area);
return 0;
}
static void free_thread_stack(struct task_struct *tsk)
{
- if (!try_release_thread_stack_to_cache(tsk->stack_vm_area))
+ if (!try_release_thread_stack_to_cache(task_stack_vm_area(tsk)))
thread_stack_delayed_free(tsk);
+#ifdef CONFIG_DYNAMIC_STACK
+ tsk->packed_stack = 0;
+#else
tsk->stack = NULL;
+#endif
tsk->stack_vm_area = NULL;
}
@@ -498,9 +791,27 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
struct vm_struct *vm_area = task_stack_vm_area(tsk);
- int i;
+ int i, nr_accounted;
- for (i = 0; i < vm_area->nr_pages; i++)
+#ifdef CONFIG_DYNAMIC_STACK
+ /*
+ * For the exit path, resolve any pending accounting to avoid
+ * underflow. Finalize to skip accounting for any faults that
+ * happen between here and this thread's final __schedule()
+ * call in do_task_dead().
+ */
+ if (account < 0) {
+ preempt_disable();
+ nr_accounted = dynamic_stack_accounting(tsk, true);
+ preempt_enable();
+ } else {
+ nr_accounted = THREAD_PREALLOC_PAGES;
+ }
+#else
+ nr_accounted = vm_area->nr_pages;
+#endif
+
+ for (i = 0; i < nr_accounted; i++)
mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
account * (PAGE_SIZE / 1024));
} else {
@@ -901,6 +1212,16 @@ void __init fork_init(void)
NULL, free_vm_stack_cache);
#endif
+#ifdef CONFIG_DYNAMIC_STACK
+ cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack",
+ dynamic_stack_refill_pages_cpu,
+ dynamic_stack_free_pages_cpu);
+ /*
+ * Fill the dynamic stack pages for the boot CPU, others will be filled
+ * as CPUs are onlined.
+ */
+ dynamic_stack_refill_pages_cpu(smp_processor_id());
+#endif
scs_init();
lockdep_init_task(&init_task);
@@ -914,6 +1235,7 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
return 0;
}
+#ifndef CONFIG_DYNAMIC_STACK
void set_task_stack_end_magic(struct task_struct *tsk)
{
unsigned long *stackend;
@@ -921,6 +1243,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
stackend = end_of_stack(tsk);
*stackend = STACK_END_MAGIC; /* for overflow detection */
}
+#endif
static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..417269a86973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6783,6 +6783,7 @@ static void __sched notrace __schedule(int sched_mode)
rq = cpu_rq(cpu);
prev = rq->curr;
+ dynamic_stack(prev);
schedule_debug(prev, preempt);
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (6 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
` (5 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
CONFIG_DEBUG_STACK_USAGE is enabled by default on most architectures.
Its purpose is to determine and print the maximum stack depth on
thread exit.
The way it works, is it starts from the bottom of the stack and
searches the first non-zero word in the stack. With dynamic stack it
does not work very well, as it means it faults every pages in every
stack.
Instead, add a specific version of stack_not_used() for dynamic stacks
where instead of starting from the bottom of the stack, we start from
the last page mapped in the stack.
In addition to not doing unnecessary page faulting, this search is
optimized by skipping search through zero pages.
Also, because dynamic stack does not end with MAGIC_NUMBER, there is
no need to skip the bottom most word in the stack.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased, Kasan oneliner needed preserving, rewrote a bit due to bugs]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[Handle init_task's use of init_stack, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
arch/Kconfig | 1 -
kernel/exit.c | 22 ++++++++++++++++++++++
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 95ded79f0825..beffe7e01296 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1542,7 +1542,6 @@ config DYNAMIC_STACK
depends on VMAP_STACK
depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN
depends on !KASAN
- depends on !DEBUG_STACK_USAGE
depends on !STACK_GROWSUP
depends on !PREEMPT_RT
help
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117fa7d4..6caf4030e8f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -71,6 +71,7 @@
#include <linux/unwind_deferred.h>
#include <linux/uaccess.h>
#include <linux/pidfs.h>
+#include <linux/vmalloc.h>
#include <uapi/linux/wait.h>
@@ -791,6 +792,26 @@ unsigned long stack_not_used(struct task_struct *p)
return (unsigned long)end_of_stack(p) - (unsigned long)n;
}
#else /* !CONFIG_STACK_GROWSUP */
+#ifdef CONFIG_DYNAMIC_STACK
+unsigned long stack_not_used(struct task_struct *p)
+{
+ struct vm_struct *vm_area = task_stack_vm_area(p);
+ unsigned long stack = (unsigned long)task_stack_page(p);
+ unsigned long alloc_size, *n;
+
+ /* This is NULL only for init_task, where init_stack is fully allocated. */
+ if (likely(vm_area))
+ alloc_size = vm_area->nr_pages << PAGE_SHIFT;
+ else
+ alloc_size = THREAD_SIZE;
+ n = (unsigned long *)(stack + THREAD_SIZE - alloc_size);
+
+ while (!*n)
+ n++;
+
+ return (unsigned long)n - stack;
+}
+#else
unsigned long stack_not_used(struct task_struct *p)
{
unsigned long *n = end_of_stack(p);
@@ -801,6 +822,7 @@ unsigned long stack_not_used(struct task_struct *p)
return (unsigned long)n - (unsigned long)end_of_stack(p);
}
+#endif /* CONFIG_DYNAMIC_STACK */
#endif /* CONFIG_STACK_GROWSUP */
/* Count the maximum pages reached in kernel stacks */
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (7 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
` (4 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
From: Pasha Tatashin <pasha.tatashin@soleen.com>
Add an accounting of the amount of stack pages that have been faulted in
and are currently in use.
Example use case:
$ cat /proc/vmstat | grep stack
nr_kernel_stack 18684
nr_dynamic_stacks_faults 156
The above shows that the kernel stacks use total 18684KiB, out of which
156KiB were faulted in.
Given that the pre-allocated stacks are 4KiB, we can determine the total
number of tasks:
tasks = (nr_kernel_stack - nr_dynamic_stacks_faults) / 4 = 4632.
The amount of kernel stack memory without dynamic stack on this machine
would be:
4632 * 16 KiB = 74,112 KiB
Therefore, in this example dynamic stacks save: 55,428 KiB
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[Rebased]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[add to memcg stats, fix typos]
Signed-off-by: David Stevens <stevensd@google.com>
---
include/linux/mmzone.h | 3 +++
kernel/fork.c | 12 +++++++++++-
mm/memcontrol.c | 10 ++++++++++
mm/vmstat.c | 3 +++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..4458fa7016a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -221,6 +221,9 @@ enum node_stat_item {
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
NR_KERNEL_STACK_KB, /* measured in KiB */
+#ifdef CONFIG_DYNAMIC_STACK
+ NR_DYNAMIC_STACKS_FAULTS_KB, /* KiB of faulted kernel stack memory */
+#endif
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e615ef736dc0..9ac9d23f5f4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,6 +463,8 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
PAGE_SIZE / 1024);
+ mod_lruvec_page_state(page, NR_DYNAMIC_STACKS_FAULTS_KB,
+ PAGE_SIZE / 1024);
}
if (finalize) {
@@ -811,9 +813,17 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
nr_accounted = vm_area->nr_pages;
#endif
- for (i = 0; i < nr_accounted; i++)
+ for (i = 0; i < nr_accounted; i++) {
mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB,
account * (PAGE_SIZE / 1024));
+#ifdef CONFIG_DYNAMIC_STACK
+ if (i >= THREAD_PREALLOC_PAGES) {
+ mod_lruvec_page_state(vm_area->pages[i],
+ NR_DYNAMIC_STACKS_FAULTS_KB,
+ account * (PAGE_SIZE / 1024));
+ }
+#endif
+ }
} else {
void *stack = task_stack_page(tsk);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cd2195a735ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -318,6 +318,9 @@ static const unsigned int memcg_node_stat_items[] = {
NR_FILE_THPS,
NR_ANON_THPS,
NR_KERNEL_STACK_KB,
+#ifdef CONFIG_DYNAMIC_STACK
+ NR_DYNAMIC_STACKS_FAULTS_KB,
+#endif
NR_PAGETABLE,
NR_SECONDARY_PAGETABLE,
#ifdef CONFIG_SWAP
@@ -1403,6 +1406,10 @@ static const struct memory_stat memory_stats[] = {
#ifdef CONFIG_NUMA_BALANCING
{ "pgpromote_success", PGPROMOTE_SUCCESS },
#endif
+
+#ifdef CONFIG_DYNAMIC_STACK
+ { "dynamic_stack_faults", NR_DYNAMIC_STACKS_FAULTS_KB },
+#endif
};
/* The actual unit of the state item, not the same as the output unit */
@@ -1415,6 +1422,9 @@ static int memcg_page_state_unit(int item)
case NR_SLAB_UNRECLAIMABLE_B:
return 1;
case NR_KERNEL_STACK_KB:
+#ifdef CONFIG_DYNAMIC_STACK
+ case NR_DYNAMIC_STACKS_FAULTS_KB:
+#endif
return SZ_1K;
default:
return PAGE_SIZE;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..8fa1c7bcbaea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1256,6 +1256,9 @@ const char * const vmstat_text[] = {
[I(NR_FOLL_PIN_ACQUIRED)] = "nr_foll_pin_acquired",
[I(NR_FOLL_PIN_RELEASED)] = "nr_foll_pin_released",
[I(NR_KERNEL_STACK_KB)] = "nr_kernel_stack",
+#ifdef CONFIG_DYNAMIC_STACK
+ [I(NR_DYNAMIC_STACKS_FAULTS_KB)] = "nr_dynamic_stacks_faults",
+#endif
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
[I(NR_KERNEL_SCS_KB)] = "nr_shadow_call_stack",
#endif
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (8 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
` (3 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
Store the task pointer in the ptes of the unpopulated pages of dynamic
stacks, to allow the vm_struct pointer to be retrieved without relying
on any locks or current.
This relies on being able to pack the struct task_struct pointer into a
pte. Since the struct is 64 byte aligned, that gives 5 bits of leeway,
which should be viable on most architectures. Any architecture which
enables dynamic thread stacks must provide make_data_kpte() and
unpack_data_kpte(), which pack/unpack a right shifted pointer value
into/from a pte.
Signed-off-by: David Stevens <stevensd@google.com>
---
include/linux/sched/task_stack.h | 1 +
kernel/fork.c | 74 +++++++++++++++++++++++++++++---
mm/vmalloc.c | 2 +-
3 files changed, 69 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 7dcff2836d7e..7cf00ce97f7c 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -105,6 +105,7 @@ void exit_task_stack_account(struct task_struct *tsk);
void dynamic_stack_refill_pages(void);
unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize);
bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack);
+struct task_struct *task_from_stack_address(unsigned long address);
/*
* Refill and charge for the used pages.
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ac9d23f5f4b..733fc1f58b8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -296,16 +296,40 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_NR]);
+#define TASK_PTR_SHIFT (ilog2(__alignof__(struct task_struct)))
+
static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_struct *vm_area)
{
+ int i;
+ unsigned long addr;
+ pte_t *ptep, pte;
+
+ pte = make_data_kpte(((unsigned long)tsk) >> TASK_PTR_SHIFT);
+
tsk->stack_vm_area = vm_area;
tsk->packed_stack = (unsigned long)kasan_reset_tag(vm_area->addr);
+
+ addr = (unsigned long)vm_area->addr;
+ ptep = virt_to_kpte(addr);
+ for (i = vm_area->nr_pages; i < THREAD_SIZE >> PAGE_SHIFT;
+ i++, addr += PAGE_SIZE, ptep++)
+ set_pte_at(&init_mm, addr, ptep, pte);
}
-static void free_vmap_stack(struct vm_struct *vm_area)
+static void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
{
int i;
+ /* Clear data kptes since vunmap expects present or none. */
+ if (was_mapped) {
+ unsigned long addr = (unsigned long)vm_area->addr;
+ pte_t *ptep = virt_to_kpte(addr);
+ unsigned int nr_to_clear = (THREAD_SIZE >> PAGE_SHIFT) - vm_area->nr_pages;
+
+ if (nr_to_clear)
+ clear_ptes(&init_mm, addr, ptep, nr_to_clear);
+ }
+
remove_vm_area(vm_area->addr);
for (i = 0; i < vm_area->nr_pages; i++)
@@ -354,7 +378,7 @@ static struct vm_struct *alloc_vmap_stack(int node)
return vm_area;
cleanup_err:
- free_vmap_stack(vm_area);
+ free_vmap_stack(vm_area, false);
return NULL;
}
@@ -477,6 +501,42 @@ unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool finalize)
return i;
}
+noinstr struct task_struct *task_from_stack_address(unsigned long address)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ BUILD_BUG_ON((BITS_PER_LONG - TASK_PTR_SHIFT) > KPTE_AVAILABLE_DATA_BITS);
+
+ if (!is_vmalloc_addr((void *)address))
+ return NULL;
+
+ pgd = pgd_offset_k(address);
+ if (pgd_none(*pgd) || pgd_leaf(*pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || p4d_leaf(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, address);
+ if (pud_none(*pud) || pud_leaf(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || pmd_leaf(*pmd))
+ return NULL;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (pte_present(*pte) || pte_none(*pte))
+ return NULL;
+
+ return (struct task_struct *)(unpack_data_kpte(*pte) << TASK_PTR_SHIFT);
+}
+
bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address, bool *on_stack)
{
unsigned long stack, hole_end, addr;
@@ -570,7 +630,7 @@ static inline struct vm_struct *alloc_vmap_stack(int node)
return stack ? find_vm_area(stack) : NULL;
}
-static inline void free_vmap_stack(struct vm_struct *vm_area)
+static inline void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped)
{
vfree(vm_area->addr);
}
@@ -590,7 +650,7 @@ static void thread_stack_free_work(struct work_struct *work)
if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
return;
- free_vmap_stack(vm_area);
+ free_vmap_stack(vm_area, true);
}
static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -618,7 +678,7 @@ static int free_vm_stack_cache(unsigned int cpu)
if (!vm_area)
continue;
- free_vmap_stack(vm_area);
+ free_vmap_stack(vm_area, true);
cached_vm_stack_areas[i] = NULL;
}
@@ -653,7 +713,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
unsigned long memset_offset = 0;
if (memcg_charge_kernel_stack(vm_area)) {
- free_vmap_stack(vm_area);
+ free_vmap_stack(vm_area, true);
return -ENOMEM;
}
@@ -674,7 +734,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
return -ENOMEM;
if (memcg_charge_kernel_stack(vm_area)) {
- free_vmap_stack(vm_area);
+ free_vmap_stack(vm_area, true);
return -ENOMEM;
}
link_vmap_stack_to_task(tsk, vm_area);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39b7e118cbce..76955c101180 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -76,7 +76,7 @@ early_param("nohugevmalloc", set_nohugevmalloc);
static const bool vmap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
-bool is_vmalloc_addr(const void *x)
+noinstr bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)kasan_reset_tag(x);
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (9 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
` (2 subsequent siblings)
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
Add missing ENCODE_FRAME_POINTER macro invocation into FRED_ENTER macro,
to prevent the unwinder from encountering a NULL stack frame pointer
when CONFIG_UNWINDER_FRAME_POINTER is enabled
Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: David Stevens <stevensd@google.com>
---
arch/x86/entry/entry_64_fred.S | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..119b8214748e 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -7,6 +7,7 @@
#include <linux/kvm_types.h>
#include <asm/asm.h>
+#include <asm/frame.h>
#include <asm/fred.h>
#include <asm/segment.h>
@@ -19,6 +20,7 @@
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
movq %rsp, %rdi /* %rdi -> pt_regs */
.endm
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (10 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
Add support for dynamic kernel stack faults by handling #PFs from CPL 0
on stack level 1. Since we can't sleep while on a per-CPU stack, any
page faults that didn't originate in an atomic context need to be
bounced back to the originating stack.
With dynamic kernel stacks, the processor pushing data onto the kernel
thread stack can cause a page fault. The SDM says in the #DF section
that the processor should be able to handle these exceptions serially.
However, this does not seem to actually be handled reliably.
With KVM, I've observed timer interrupts dropped. The corresponding bit
in VIRR is cleared and the ISR bit in the APIC is set before the #PF is
delivered, but the interrupt handler is not invoked after the kernel
stack fault is resolved. On bare metal, I've observed frequent hangs due
to threads getting stuck on folio_wait_bit_common. I haven't traced this
to an exact interrupt being lost, but moving interrupts to stack level 1
reduces boot failures from >10% to 0 in 1000s of attempts.
To work around this, external interrupts are also moved to stack level
1, and unconditionally bounced back to the originating stack.
Bouncing page faults and external interrupts through stack level 1 while
in CPL 0 adds a small but non-trivial overhead to those paths. The
shared entry point for events received in CPL 0 also becomes slightly
more expensive, due to the need to detect page faults and external
interrupts.
Since enabling HAVE_ARCH_DYNAMIC_STACK requires unconditional support,
enabling the config is done in the next patch that adds dynamic stack
support for traditional interrupt delivery.
Signed-off-by: David Stevens <stevensd@google.com>
---
arch/x86/entry/entry_64_fred.S | 55 +++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 36 ++++++++++++++++++++
arch/x86/include/asm/traps.h | 5 +++
arch/x86/kernel/fred.c | 20 ++++++++---
arch/x86/mm/dump_pagetables.c | 14 +++++---
arch/x86/mm/fault.c | 53 +++++++++++++++++++++++++++++
6 files changed, 174 insertions(+), 9 deletions(-)
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 119b8214748e..7202655ef662 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -54,7 +54,62 @@ SYM_CODE_END(asm_fred_entrypoint_user)
.org asm_fred_entrypoint_user + 256, 0xcc
SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
FRED_ENTER
+
+#ifdef CONFIG_DYNAMIC_STACK
+ /* Extract event type and vector from augmented SS. */
+ movl (SS + 4)(%rsp), %esi
+ andl $0x000f00ff, %esi
+
+ /* Check if event type is hardware exception and vector is #PF. */
+ cmpl $0x0003000e, %esi
+ jne .Lcheck_for_extint
+
+ call handle_dynamic_stack_kernel_faults
+ testq %rax, %rax
+ jz .Lentrypoint_done
+ cmpq %rax, %rsp
+ je .Lskip_stack_switch
+ jmp .Ldo_stack_switch
+
+.Lcheck_for_extint:
+ /* Check if event type is external interrupt. */
+ andl $0xf0000, %esi
+ testl %esi, %esi
+ jne .Lcall_primary_entry
+ call switch_to_kstack
+
+.Ldo_stack_switch:
+#ifdef CONFIG_DEBUG_ENTRY
+ /*
+ * We should only do a stack switch for an external interrupt or a page
+ * fault in a non-atomic context. These should only ever happen in user
+ * space or from a regular kernel stack (i.e. CSL == 0).
+ */
+ movw (CS + 2)(%rsp), %si
+ testw $0x3, %si
+ jz .Lcsl_ok
+ ud2
+.Lcsl_ok:
+#endif
+ movq %rax, %rsp
+
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
+ mov $MSR_IA32_FRED_CONFIG, %ecx
+ rdmsr
+ andl $~0x3, %eax
+ wrmsr
+
+ movq %rsp, %rdi
+#endif
+
+.Lskip_stack_switch:
+ movq %rsp, %rdi
+.Lcall_primary_entry:
call fred_entry_from_kernel
+
+.Lentrypoint_done:
FRED_EXIT
ERETS
SYM_CODE_END(asm_fred_entrypoint_kernel)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index ce45882ccd07..fbb042c89d13 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -237,6 +237,42 @@ static inline void native_pgd_clear(pgd_t *pgd)
#define __swp_entry_to_pte(x) (__pte((x).val))
#define __swp_entry_to_pmd(x) (__pmd((x).val))
+#ifdef CONFIG_DYNAMIC_STACK
+
+/*
+ * Skip the present bit. And skip dirty and accessed bits due to
+ * erratum where they can be incorrectly set on non-present ptes.
+ *
+ * Also skip bit 8, which is used for pte_present for PROT_NONE. This
+ * isn't necessary in the strictest sense since PROT_NONE doesn't apply
+ * to kernel PTEs, but it's easier to let pte_present just continue
+ * to work.
+ */
+#define KPTE_AVAILABLE_DATA_BITS 58
+
+static inline pte_t make_data_kpte(unsigned long val)
+{
+ unsigned long low_part, mid_part, high_part;
+
+ low_part = (val & 0xf) << 1;
+ mid_part = (val & 0x10) << 3;
+ high_part = (val & ~0x1f) << 4;
+
+ return __pte(low_part | mid_part | high_part);
+}
+
+static inline unsigned long unpack_data_kpte(pte_t pte)
+{
+ unsigned long val = pte_val(pte), high_part, mid_part, low_part;
+
+ low_part = (val >> 1) & 0xf;
+ mid_part = (val >> 3) & 0x10;
+ high_part = (val >> 4) & ~0x1f;
+
+ return low_part | mid_part | high_part;
+}
+#endif /* CONFIG_DYNAMIC_STACK */
+
extern void cleanup_highmap(void);
#define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 3f24cc472ce9..6b55eb91aea6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,6 +15,11 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+
+#ifdef CONFIG_DYNAMIC_STACK
+asmlinkage __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs);
+asmlinkage __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs);
+#endif
#endif
extern int ibt_selftest(void);
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index e736b19e18de..01d727420d1f 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -9,6 +9,8 @@
/* #DB in the kernel would imply the use of a kernel debugger. */
#define FRED_DB_STACK_LEVEL 1UL
+#define FRED_PF_STACK_LEVEL 1UL
+#define FRED_INT_STACK_LEVEL 1UL
#define FRED_NMI_STACK_LEVEL 2UL
#define FRED_MC_STACK_LEVEL 2UL
/*
@@ -25,6 +27,11 @@
DEFINE_PER_CPU(unsigned long, fred_rsp0);
EXPORT_PER_CPU_SYMBOL(fred_rsp0);
+#define FRED_CONFIG_VAL(int_stklvl) \
+ (FRED_CONFIG_REDZONE /* Reserve for CALL emulation */ | \
+ FRED_CONFIG_INT_STKLVL(int_stklvl) | \
+ FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user))
+
void cpu_init_fred_exceptions(void)
{
/* When FRED is enabled by default, remove this log message */
@@ -44,11 +51,7 @@ void cpu_init_fred_exceptions(void)
*/
loadsegment(ss, __KERNEL_DS);
- wrmsrq(MSR_IA32_FRED_CONFIG,
- /* Reserve for CALL emulation */
- FRED_CONFIG_REDZONE |
- FRED_CONFIG_INT_STKLVL(0) |
- FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
+ wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(0));
wrmsrq(MSR_IA32_FRED_STKLVLS, 0);
@@ -84,8 +87,15 @@ void cpu_init_fred_rsps(void)
FRED_STKLVL(X86_TRAP_DB, FRED_DB_STACK_LEVEL) |
FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) |
FRED_STKLVL(X86_TRAP_MC, FRED_MC_STACK_LEVEL) |
+#ifdef CONFIG_DYNAMIC_STACK
+ FRED_STKLVL(X86_TRAP_PF, FRED_PF_STACK_LEVEL) |
+#endif
FRED_STKLVL(X86_TRAP_DF, FRED_DF_STACK_LEVEL));
+#ifdef CONFIG_DYNAMIC_STACK
+ wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(FRED_INT_STACK_LEVEL));
+#endif
+
/* The FRED equivalents to IST stacks... */
wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2afa7a23340e..5c33c33e93fe 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -306,11 +306,17 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
static const char units[] = "BKMGTPE";
struct seq_file *m = st->seq;
- new_prot = val & PTE_FLAGS_MASK;
- if (!val)
+ /* Ignore prot/eff from data kptes. */
+ if (val & _PAGE_PRESENT || addr < address_markers[KERNEL_SPACE_NR].start_address) {
+ new_prot = val & PTE_FLAGS_MASK;
+ if (!val)
+ new_eff = 0;
+ else
+ new_eff = st->prot_levels[level];
+ } else {
+ new_prot = 0;
new_eff = 0;
- else
- new_eff = st->prot_levels[level];
+ }
/*
* If we have a "break" in the series, we need to flush the state that
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b83a06739b51..40d518d9f562 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1480,6 +1480,59 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
local_irq_disable();
}
+#ifdef CONFIG_DYNAMIC_STACK
+
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+{
+ unsigned long new_sp;
+ unsigned long data_len;
+
+ new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+ new_sp &= FRED_STACK_FRAME_RSP_MASK;
+ data_len = sizeof(struct fred_frame);
+ new_sp -= data_len;
+
+ memcpy((void *)new_sp, regs, data_len);
+
+ return new_sp;
+}
+
+__visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
+{
+ return copy_stack_data(regs);
+}
+
+#define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
+
+__visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_regs *regs)
+{
+ unsigned long address;
+ struct task_struct *tsk;
+ bool on_stack;
+
+ address = fred_event_data(regs);
+ if (fault_in_kernel_space(address) && !in_nmi()) {
+ tsk = task_from_stack_address(address);
+
+ if (tsk && dynamic_stack_fault(tsk, address, &on_stack)) {
+ WARN_ON_ONCE(tsk != current &&
+ ALIGN_TO_STACK(regs->sp) != ALIGN_TO_STACK(address));
+ return 0;
+ }
+ }
+
+ /*
+ * The regular fault handler won't sleep when executing in an
+ * atomic context, so we can complete the #PF directly on the
+ * #PF stack.
+ */
+ if (in_atomic())
+ return (unsigned long)regs;
+ else
+ return copy_stack_data(regs);
+}
+#endif
+
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
irqentry_state_t state;
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (11 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
@ 2026-04-24 19:14 ` David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
13 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: David Stevens, linux-kernel, linux-mm
On hardware that doesn't support FRED, use ISTs to support dynamic
kernel stacks. In the same way as we do when using FRED, any regular #PF
gets manually moved back onto the original stack. Additionally, we take
the similar approach as we do with FRED to avoid issues with interrupt
re-delivery and handle external interrupts on an IST stack.
The fact that IST stacks aren't reentrant means we have to be very
careful to avoid triggering a #PF while the #PF IST is being used. Since
NMIs can trigger #PFs, we have the NMI handler temporarily install a
secondary #PF IST stack if it detects it came from the #PF IST stack, to
avoid clobbering that stack. Note that although iret unmasking of NMIs
can cause us to get a second NMI while an NMI is on the #PF IST stack,
the actual handling of that secondary NMI will be delayed until after
the original NMI (and thus the #PF) is resolved. As such, one extra #PF
IST stack is sufficient to resolve reentrancy issues with respect to
NMIs.
For #DB exceptions, we make sure that all code that executes on the #PF
IST stack is noinstr. Unfortunately this is not 100% bulletproof, since
the handler needs to access data outside of cpu_entry_area (e.g.
current, current's stack, vmap stack page tables), and the user could
have set hardware breakpoints on accesses to those addresses. Rather
than handle this edge case that should only occur during manual
debugging, we just detect reentrancy on the #PF IST and abort.
It is possible for #MCE to occur on the #PF IST stack, but the #MCE
handler shouldn't generate new #PFs. The reentrancy check on the #PF
stack will trigger if any recoverable #MCEs do generate #PFs - if there
are actually reports of it happening, we can address it then.
Bouncing all #PF and external interrupts through IST stacks adds some
overhead. However, such events from userspace already had to bounce
through the CPU entry stack, so introducing ISTs only adds notable
overhead for #PFs and external interrupts that occur while in CPL 0.
Signed-off-by: David Stevens <stevensd@google.com>
---
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_64.S | 49 +++++++++++++++++--
arch/x86/include/asm/cpu_entry_area.h | 18 +++++++
arch/x86/include/asm/idtentry.h | 38 ++++++++++++++-
arch/x86/include/asm/page_64_types.h | 10 +++-
arch/x86/include/asm/processor.h | 6 +++
arch/x86/kernel/cpu/common.c | 11 +++++
arch/x86/kernel/dumpstack_64.c | 10 +++-
arch/x86/kernel/idt.c | 57 +++++++++++++---------
arch/x86/kernel/nmi.c | 9 ++++
arch/x86/lib/usercopy.c | 9 ++++
arch/x86/mm/cpu_entry_area.c | 17 +++++++
arch/x86/mm/fault.c | 70 ++++++++++++++++++++++-----
13 files changed, 262 insertions(+), 43 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..182fda721b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -212,6 +212,7 @@ config X86
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
select HAVE_ARCH_VMAP_STACK if X86_64
+ select HAVE_ARCH_DYNAMIC_STACK if X86_64 && !XEN_PV
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_WITHIN_STACK_FRAMES
select HAVE_ASM_MODVERSIONS
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..02dbd00cc4bb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -286,7 +286,7 @@ SYM_CODE_END(xen_error_entry)
* @cfunc: C function to be called
* @has_error_code: Hardware pushed error code on stack
*/
-.macro idtentry_body cfunc has_error_code:req
+.macro idtentry_body cfunc has_error_code:req kernel_reentry_fn=
/*
* Call error_entry() and switch to the task stack if from userspace.
@@ -302,6 +302,38 @@ SYM_CODE_END(xen_error_entry)
ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+ /*
+ * For entry from userspace, we've also already moved off of
+ * the IST after calling error_entry above.
+ */
+ testb $3, CS(%rsp)
+ jnz .Lregular_fault_\cfunc
+
+ /* Check and set the reentry canary reserved by IST_ENTRY_OFFSET. */
+ cmpq $0, (SS + 8)(%rsp)
+ jne .List_reentry_abort_\cfunc
+ movq $1, (SS + 8)(%rsp)
+
+ movq %rsp, %rdi
+ call \kernel_reentry_fn
+
+ movq $0, (SS + 8)(%rsp)
+
+ testq %rax, %rax
+ jnz .Lchange_stack_\cfunc
+ jmp error_return
+
+.Lchange_stack_\cfunc:
+ movq %rax, %rsp
+
+ ENCODE_FRAME_POINTER
+ UNWIND_HINT_REGS
+.Lregular_fault_\cfunc:
+.endif
+#endif
+
movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
.if \has_error_code == 1
@@ -314,6 +346,13 @@ SYM_CODE_END(xen_error_entry)
call \cfunc
jmp error_return
+
+#ifdef CONFIG_DYNAMIC_STACK
+.ifnb \kernel_reentry_fn
+.List_reentry_abort_\cfunc:
+ ud2
+.endif
+#endif
.endm
/**
@@ -322,11 +361,13 @@ SYM_CODE_END(xen_error_entry)
* @asmsym: ASM symbol for the entry point
* @cfunc: C function to be called
* @has_error_code: Hardware pushed error code on stack
+ * @kernel_reentry_fn: If set, C function to be called on re-entry from
+ * kernel space before the main handler is invoked.
*
* The macro emits code to set up the kernel context for straight forward
* and simple IDT entries. No IST stack, no paranoid entry checks.
*/
-.macro idtentry vector asmsym cfunc has_error_code:req
+.macro idtentry vector asmsym cfunc has_error_code:req kernel_reentry_fn=
SYM_CODE_START(\asmsym)
.if \vector == X86_TRAP_BP
@@ -358,7 +399,7 @@ SYM_CODE_START(\asmsym)
.Lfrom_usermode_no_gap_\@:
.endif
- idtentry_body \cfunc \has_error_code
+ idtentry_body \cfunc \has_error_code \kernel_reentry_fn
_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
@@ -375,7 +416,7 @@ SYM_CODE_END(\asmsym)
*/
.macro idtentry_irq vector cfunc
.p2align CONFIG_X86_L1_CACHE_SHIFT
- idtentry \vector asm_\cfunc \cfunc has_error_code=1
+ idtentry \vector asm_\cfunc \cfunc has_error_code=1 kernel_reentry_fn=switch_to_kstack
.endm
/**
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..5bce3259edee 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -26,6 +26,12 @@
char DB_stack[EXCEPTION_STKSZ]; \
char MCE_stack_guard[guardsize]; \
char MCE_stack[EXCEPTION_STKSZ]; \
+ char PF_stack_guard[guardsize]; \
+ char PF_stack[EXCEPTION_STKSZ]; \
+ char PF2_stack_guard[guardsize]; \
+ char PF2_stack[EXCEPTION_STKSZ]; \
+ char UDI_stack_guard[guardsize]; \
+ char UDI_stack[EXCEPTION_STKSZ]; \
char VC_stack_guard[guardsize]; \
char VC_stack[optional_stack_size]; \
char VC2_stack_guard[guardsize]; \
@@ -50,6 +56,9 @@ enum exception_stack_ordering {
ESTACK_NMI,
ESTACK_DB,
ESTACK_MCE,
+ ESTACK_PF,
+ ESTACK_PF2,
+ ESTACK_UDI,
ESTACK_VC,
ESTACK_VC2,
N_EXCEPTION_STACKS
@@ -144,6 +153,15 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
}
+#ifdef CONFIG_DYNAMIC_STACK
+bool is_pf_ist_stack(unsigned long addr);
+#else
+static inline bool is_pf_ist_stack(unsigned long addr)
+{
+ return false;
+}
+#endif
+
#define __this_cpu_ist_top_va(name) \
CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 42bf6a58ec36..d8c846d28a1d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -163,6 +163,16 @@ noinstr void fred_##func(struct pt_regs *regs)
#define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func) \
DECLARE_IDTENTRY_ERRORCODE(vector, func)
+/**
+ * DECLARE_IDTENTRY_PF - Declare functions for page fault entry point
+ * @vector: Vector number (ignored for C)
+ * @func: Function name of the entry point
+ *
+ * Maps to @DECLARE_IDTENTRY_ERRORCODE().
+ */
+#define DECLARE_IDTENTRY_PF(vector, func) \
+ DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+
/**
* DEFINE_IDTENTRY_RAW_ERRORCODE - Emit code for raw IDT entry points
* @func: Function name of the entry point
@@ -391,6 +401,15 @@ static __always_inline void __##func(struct pt_regs *regs)
#define DEFINE_IDTENTRY_DF(func) \
DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+/**
+ * DEFINE_IDTENTRY_PF - Emit code for page fault
+ * @func: Function name of the entry point
+ *
+ * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
+ */
+#define DEFINE_IDTENTRY_PF(func) \
+ DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+
/**
* DEFINE_IDTENTRY_VC_KERNEL - Emit code for VMM communication handler
* when raised from kernel mode
@@ -480,6 +499,15 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
#define DECLARE_IDTENTRY_ERRORCODE(vector, func) \
idtentry vector asm_##func func has_error_code=1
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_PF(vector, func) \
+ idtentry vector asm_##func func has_error_code=1 \
+ kernel_reentry_fn=handle_dynamic_stack_kernel_faults
+#else
+#define DECLARE_IDTENTRY_PF(vector, func) \
+ DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)
+#endif
+
/* Special case for 32bit IRET 'trap'. Do not emit ASM code */
#define DECLARE_IDTENTRY_SW(vector, func)
@@ -494,8 +522,14 @@ void fred_install_sysvec(unsigned int vector, const idtentry_t function);
idtentry_irq vector func
/* System vector entries */
+#ifdef CONFIG_DYNAMIC_STACK
+#define DECLARE_IDTENTRY_SYSVEC(vector, func) \
+ idtentry vector asm_##func func has_error_code=0 \
+ kernel_reentry_fn=switch_to_kstack
+#else
#define DECLARE_IDTENTRY_SYSVEC(vector, func) \
DECLARE_IDTENTRY(vector, func)
+#endif
#ifdef CONFIG_X86_64
# define DECLARE_IDTENTRY_MCE(vector, func) \
@@ -615,7 +649,7 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC, exc_alignment_check);
/* Raw exception entries which need extra work */
DECLARE_IDTENTRY_RAW(X86_TRAP_UD, exc_invalid_op);
DECLARE_IDTENTRY_RAW(X86_TRAP_BP, exc_int3);
-DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF, exc_page_fault);
+DECLARE_IDTENTRY_PF(X86_TRAP_PF, exc_page_fault);
#if defined(CONFIG_IA32_EMULATION)
DECLARE_IDTENTRY_RAW(IA32_SYSCALL_VECTOR, int80_emulation);
@@ -699,7 +733,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR, sysvec_x86_platform_ipi);
#endif
#ifdef CONFIG_SMP
-DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
+DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot);
DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, sysvec_call_function_single);
DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, sysvec_call_function);
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 7400dab373fe..b0b60f83a531 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -28,7 +28,15 @@
#define IST_INDEX_NMI 1
#define IST_INDEX_DB 2
#define IST_INDEX_MCE 3
-#define IST_INDEX_VC 4
+#define IST_INDEX_PF 4
+#define IST_INDEX_UDI 5
+#define IST_INDEX_VC 6
+
+/*
+ * Offset used for some IST stacks to reserve a slot for re-entry
+ * canary. At the very top of the stack for cache friendliness.
+ */
+#define IST_ENTRY_OFFSET 8
/*
* Set __PAGE_OFFSET to the most negative possible address +
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a24c7805acdb..fa790731dea0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -573,6 +573,12 @@ static inline void load_sp0(unsigned long sp0)
#endif /* CONFIG_PARAVIRT_XXL */
+#ifdef CONFIG_DYNAMIC_STACK
+void install_nmi_pf_stack(bool use_nmi_pf_stack);
+#else
+static inline void install_nmi_pf_stack(bool use_nmi_pf_stack) {}
+#endif
+
unsigned long __get_wchan(struct task_struct *p);
extern void select_idle_routine(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ec0670114efa..d90a01e2fdd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2377,6 +2377,8 @@ static inline void tss_setup_ist(struct tss_struct *tss)
tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+ tss->x86_tss.ist[IST_INDEX_PF] = __this_cpu_ist_top_va(PF) - IST_ENTRY_OFFSET;
+ tss->x86_tss.ist[IST_INDEX_UDI] = __this_cpu_ist_top_va(UDI) - IST_ENTRY_OFFSET;
/* Only mapped when SEV-ES is active */
tss->x86_tss.ist[IST_INDEX_VC] = __this_cpu_ist_top_va(VC);
}
@@ -2665,3 +2667,12 @@ void __init arch_cpu_finalize_init(void)
*/
mem_encrypt_init();
}
+
+#ifdef CONFIG_DYNAMIC_STACK
+noinstr void install_nmi_pf_stack(bool use_nmi_pf_stack)
+{
+ unsigned long stack = use_nmi_pf_stack ? __this_cpu_ist_top_va(PF2)
+ : __this_cpu_ist_top_va(PF);
+ this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_PF], stack - IST_ENTRY_OFFSET);
+}
+#endif
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 6c5defd6569a..6784d31d3eb3 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -24,13 +24,16 @@ static const char * const exception_stack_names[] = {
[ ESTACK_NMI ] = "NMI",
[ ESTACK_DB ] = "#DB",
[ ESTACK_MCE ] = "#MC",
+ [ ESTACK_PF ] = "#PF",
+ [ ESTACK_PF2 ] = "#PF2",
+ [ ESTACK_UDI ] = "#UDI",
[ ESTACK_VC ] = "#VC",
[ ESTACK_VC2 ] = "#VC2",
};
const char *stack_type_name(enum stack_type type)
{
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
if (type == STACK_TYPE_TASK)
return "TASK";
@@ -87,6 +90,9 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
EPAGERANGE(NMI),
EPAGERANGE(DB),
EPAGERANGE(MCE),
+ EPAGERANGE(PF),
+ EPAGERANGE(PF2),
+ EPAGERANGE(UDI),
EPAGERANGE(VC),
EPAGERANGE(VC2),
};
@@ -98,7 +104,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
struct pt_regs *regs;
unsigned int k;
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 9);
begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
/*
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..7626fa7adfb3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -116,6 +116,10 @@ static const __initconst struct idt_data def_idts[] = {
ISTG(X86_TRAP_VC, asm_exc_vmm_communication, IST_INDEX_VC),
#endif
+#ifdef CONFIG_DYNAMIC_STACK
+ ISTG(X86_TRAP_PF, asm_exc_page_fault, IST_INDEX_PF),
+#endif
+
SYSG(X86_TRAP_OF, asm_exc_overflow),
};
@@ -127,47 +131,55 @@ static const struct idt_data ia32_idt[] __initconst = {
#endif
};
+#ifdef CONFIG_DYNAMIC_STACK
+#define EXTERNAL_INTR(_vector, _addr) ISTG(_vector, _addr, IST_INDEX_UDI)
+#define EXTERNAL_INTR_IST_VALUE (IST_INDEX_UDI + 1)
+#else
+#define EXTERNAL_INTR(_vector, _addr) INTG(_vector, _addr)
+#define EXTERNAL_INTR_IST_VALUE 0
+#endif
+
/*
* The APIC and SMP idt entries
*/
static const __initconst struct idt_data apic_idts[] = {
#ifdef CONFIG_SMP
- INTG(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi),
- INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function),
- INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single),
- INTG(REBOOT_VECTOR, asm_sysvec_reboot),
+ EXTERNAL_INTR(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi),
+ EXTERNAL_INTR(CALL_FUNCTION_VECTOR, asm_sysvec_call_function),
+ EXTERNAL_INTR(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single),
+ EXTERNAL_INTR(REBOOT_VECTOR, asm_sysvec_reboot),
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
- INTG(THERMAL_APIC_VECTOR, asm_sysvec_thermal),
+ EXTERNAL_INTR(THERMAL_APIC_VECTOR, asm_sysvec_thermal),
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
- INTG(THRESHOLD_APIC_VECTOR, asm_sysvec_threshold),
+ EXTERNAL_INTR(THRESHOLD_APIC_VECTOR, asm_sysvec_threshold),
#endif
#ifdef CONFIG_X86_MCE_AMD
- INTG(DEFERRED_ERROR_VECTOR, asm_sysvec_deferred_error),
+ EXTERNAL_INTR(DEFERRED_ERROR_VECTOR, asm_sysvec_deferred_error),
#endif
#ifdef CONFIG_X86_LOCAL_APIC
- INTG(LOCAL_TIMER_VECTOR, asm_sysvec_apic_timer_interrupt),
- INTG(X86_PLATFORM_IPI_VECTOR, asm_sysvec_x86_platform_ipi),
+ EXTERNAL_INTR(LOCAL_TIMER_VECTOR, asm_sysvec_apic_timer_interrupt),
+ EXTERNAL_INTR(X86_PLATFORM_IPI_VECTOR, asm_sysvec_x86_platform_ipi),
# if IS_ENABLED(CONFIG_KVM)
- INTG(POSTED_INTR_VECTOR, asm_sysvec_kvm_posted_intr_ipi),
- INTG(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeup_ipi),
- INTG(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_nested_ipi),
+ EXTERNAL_INTR(POSTED_INTR_VECTOR, asm_sysvec_kvm_posted_intr_ipi),
+ EXTERNAL_INTR(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeup_ipi),
+ EXTERNAL_INTR(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_nested_ipi),
# endif
#ifdef CONFIG_GUEST_PERF_EVENTS
INTG(PERF_GUEST_MEDIATED_PMI_VECTOR, asm_sysvec_perf_guest_mediated_pmi_handler),
#endif
# ifdef CONFIG_IRQ_WORK
- INTG(IRQ_WORK_VECTOR, asm_sysvec_irq_work),
+ EXTERNAL_INTR(IRQ_WORK_VECTOR, asm_sysvec_irq_work),
# endif
- INTG(SPURIOUS_APIC_VECTOR, asm_sysvec_spurious_apic_interrupt),
- INTG(ERROR_APIC_VECTOR, asm_sysvec_error_interrupt),
+ EXTERNAL_INTR(SPURIOUS_APIC_VECTOR, asm_sysvec_spurious_apic_interrupt),
+ EXTERNAL_INTR(ERROR_APIC_VECTOR, asm_sysvec_error_interrupt),
# ifdef CONFIG_X86_POSTED_MSI
- INTG(POSTED_MSI_NOTIFICATION_VECTOR, asm_sysvec_posted_msi_notification),
+ EXTERNAL_INTR(POSTED_MSI_NOTIFICATION_VECTOR, asm_sysvec_posted_msi_notification),
# endif
#endif
};
@@ -206,11 +218,12 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
}
}
-static __init void set_intr_gate(unsigned int n, const void *addr)
+static __init void set_intr_gate(unsigned int n, const void *addr, int ist)
{
struct idt_data data;
init_idt_data(&data, n, addr);
+ data.bits.ist = ist;
idt_setup_from_table(idt_table, &data, 1, false);
}
@@ -293,7 +306,7 @@ void __init idt_setup_apic_and_irq_gates(void)
for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
- set_intr_gate(i, entry);
+ set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
}
#ifdef CONFIG_X86_LOCAL_APIC
@@ -304,7 +317,7 @@ void __init idt_setup_apic_and_irq_gates(void)
* /proc/interrupts.
*/
entry = spurious_entries_start + IDT_ALIGN * (i - FIRST_SYSTEM_VECTOR);
- set_intr_gate(i, entry);
+ set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE);
}
#endif
/* Map IDT into CPU entry area and reload it. */
@@ -325,10 +338,10 @@ void __init idt_setup_early_handler(void)
int i;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
- set_intr_gate(i, early_idt_handler_array[i]);
+ set_intr_gate(i, early_idt_handler_array[i], DEFAULT_STACK);
#ifdef CONFIG_X86_32
for ( ; i < NR_VECTORS; i++)
- set_intr_gate(i, early_ignore_irq);
+ set_intr_gate(i, early_ignore_irq, DEFAULT_STACK);
#endif
load_idt(&idt_descr);
}
@@ -352,5 +365,5 @@ void __init idt_install_sysvec(unsigned int n, const void *function)
return;
if (!WARN_ON(test_and_set_bit(n, system_vectors)))
- set_intr_gate(n, function);
+ set_intr_gate(n, function, EXTERNAL_INTR_IST_VALUE);
}
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..a2444b9d5b71 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -37,6 +37,7 @@
#include <asm/microcode.h>
#include <asm/sev.h>
#include <asm/fred.h>
+#include <asm/cpu_entry_area.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -581,6 +582,11 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
if (IS_ENABLED(CONFIG_NMI_CHECK_CPU) && ignore_nmis) {
WRITE_ONCE(nsp->idt_ignored, nsp->idt_ignored + 1);
} else if (!ignore_nmis) {
+ bool protect_pf_ist_stack = is_pf_ist_stack(regs->sp);
+
+ if (protect_pf_ist_stack)
+ install_nmi_pf_stack(true);
+
if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) {
WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
WARN_ON_ONCE(!(nsp->idt_nmi_seq & 0x1));
@@ -590,6 +596,9 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1);
WARN_ON_ONCE(nsp->idt_nmi_seq & 0x1);
}
+
+ if (protect_pf_ist_stack)
+ install_nmi_pf_stack(false);
}
irqentry_nmi_exit(regs, irq_state);
diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index 24b48af27417..75b9f851f428 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -9,6 +9,7 @@
#include <linux/instrumented.h>
#include <asm/tlbflush.h>
+#include <asm/cpu_entry_area.h>
/**
* copy_from_user_nmi - NMI safe copy from user
@@ -39,6 +40,14 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
if (!nmi_uaccess_okay())
return n;
+ /*
+ * IST stacks aren't reentrant, so bail before the possibility of
+ * a #PF. While on the #PF IST stack, we should only need this
+ * function for stack dumps (WARN/panic/etc).
+ */
+ if (is_pf_ist_stack(current_stack_pointer))
+ return n;
+
/*
* Even though this function is typically called from NMI/IRQ context
* disable pagefaults so that its behaviour is consistent even when
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..97ac91c109ed 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -156,6 +156,12 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(DB);
cea_map_stack(MCE);
+ if (IS_ENABLED(CONFIG_DYNAMIC_STACK)) {
+ cea_map_stack(PF);
+ cea_map_stack(PF2);
+ cea_map_stack(UDI);
+ }
+
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
cea_map_stack(VC);
@@ -173,6 +179,17 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
}
#endif
+#ifdef CONFIG_DYNAMIC_STACK
+bool noinstr is_pf_ist_stack(unsigned long addr)
+{
+ struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks);
+ unsigned long top = CEA_ESTACK_TOP(cs, PF2);
+ unsigned long bot = CEA_ESTACK_BOT(cs, PF);
+
+ return addr >= bot && addr < top;
+}
+#endif
+
/* Setup the fixmap mappings only once per-processor */
static void __init setup_cpu_entry_area(unsigned int cpu)
{
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 40d518d9f562..48ef50982c06 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,16 +1482,61 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
#ifdef CONFIG_DYNAMIC_STACK
-static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
+static noinstr unsigned long copy_stack_data(struct pt_regs *regs, bool is_dynamic_stack_fault)
{
unsigned long new_sp;
unsigned long data_len;
+ bool must_avoid_dynamic_stack_fault;
- new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
- new_sp &= FRED_STACK_FRAME_RSP_MASK;
- data_len = sizeof(struct fred_frame);
+ if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+ new_sp = regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6);
+ new_sp &= FRED_STACK_FRAME_RSP_MASK;
+ data_len = sizeof(struct fred_frame);
+ must_avoid_dynamic_stack_fault = false;
+ } else {
+ // Hardware aligns sp to a 16 byte boundary when going through the IDT.
+ new_sp = ALIGN_DOWN(regs->sp, 16);
+ data_len = sizeof(struct pt_regs);
+ must_avoid_dynamic_stack_fault = is_dynamic_stack_fault;
+ }
new_sp -= data_len;
+ if (must_avoid_dynamic_stack_fault) {
+ bool new_sp_on_stack;
+
+ /*
+ * We don't have to worry about the window where current_task
+ * is inconsistent during a context switch because interrupts
+ * are disabled during that window and the only #PF that can
+ * happen there is a dynamic stack fault, in which case we
+ * return directly from handle_dynamic_stack_kernel_faults().
+ */
+ if (!in_nmi())
+ dynamic_stack_fault(current, new_sp, &new_sp_on_stack);
+ else
+ new_sp_on_stack = false;
+
+ /*
+ * If new_sp isn't on the current task's stack, verify that it's
+ * on an exception/irq/entry stack. This is a little expensive,
+ * but #PFs in those contexts should be rare.
+ */
+ if (!new_sp_on_stack) {
+ struct stack_info info, info2;
+
+ if (!get_stack_info_noinstr((void *)new_sp, current, &info)) {
+ instrumentation_begin();
+ if (get_stack_info_noinstr((void *)(new_sp - PAGE_SIZE),
+ current, &info2)) {
+ pr_emerg("Stack overflow during stack switch\n");
+ handle_stack_overflow(regs, new_sp, &info2);
+ } else {
+ die("Stack switch back to unknown stack", regs, 0);
+ }
+ }
+ }
+ }
+
memcpy((void *)new_sp, regs, data_len);
return new_sp;
@@ -1499,7 +1544,7 @@ static noinstr unsigned long copy_stack_data(struct pt_regs *regs)
__visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs)
{
- return copy_stack_data(regs);
+ return copy_stack_data(regs, false);
}
#define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1))
@@ -1510,7 +1555,7 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
struct task_struct *tsk;
bool on_stack;
- address = fred_event_data(regs);
+ address = cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs) : read_cr2();
if (fault_in_kernel_space(address) && !in_nmi()) {
tsk = task_from_stack_address(address);
@@ -1522,18 +1567,19 @@ __visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct pt_reg
}
/*
- * The regular fault handler won't sleep when executing in an
- * atomic context, so we can complete the #PF directly on the
- * #PF stack.
+ * The regular fault handler won't sleep when executing in an atomic
+ * context, so we can complete the #PF directly on the #PF stack.
+ * However, IST doesn't support nested exceptions, so we need to avoid
+ * running any non-noinstr code on the IST #PF stack.
*/
- if (in_atomic())
+ if (in_atomic() && cpu_feature_enabled(X86_FEATURE_FRED))
return (unsigned long)regs;
else
- return copy_stack_data(regs);
+ return copy_stack_data(regs, true);
}
#endif
-DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+DEFINE_IDTENTRY_PF(exc_page_fault)
{
irqentry_state_t state;
unsigned long address;
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
` (12 preceding siblings ...)
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
@ 2026-04-24 19:41 ` Dave Hansen
2026-04-24 21:35 ` Pasha Tatashin
2026-04-25 9:19 ` H. Peter Anvin
13 siblings, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2026-04-24 19:41 UTC (permalink / raw)
To: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: linux-kernel, linux-mm
On 4/24/26 12:14, David Stevens wrote:
> The question is then: is this approach something that is fundamentally
> untenable in the kernel
Yes. Fundamentally untenable.
Not allowing stack faults has been a wonderful simplification. It's one
of those things that just plain makes the kernel easier to maintain.
Saving low single digits of system memory is not exactly making me eager
to go back to the harder-to-maintain days.
I seriously doubt that this 1% is the lowest hanging fruit for memory
bloat on these systems. ;)
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
@ 2026-04-24 21:35 ` Pasha Tatashin
2026-04-24 22:21 ` Dave Hansen
2026-04-24 22:26 ` David Laight
2026-04-25 9:19 ` H. Peter Anvin
1 sibling, 2 replies; 21+ messages in thread
From: Pasha Tatashin @ 2026-04-24 21:35 UTC (permalink / raw)
To: Dave Hansen
Cc: David Stevens, Pasha Tatashin, Linus Walleij, Will Deacon,
Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
linux-kernel, linux-mm
On 04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
> > The question is then: is this approach something that is fundamentally
> > untenable in the kernel
>
> Yes. Fundamentally untenable.
>
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
>
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)
This true until, in a fleet of millions of machines, you encounter a
one-in-a-billion chance of a stack overflow. You are then forced to
double the statically allocated kernel stacks on every machine, paying a
memory tax even though 99.999..% of threads never exceed 4K. This
overhead accumulates to petabytes of wasted capacity.
Pasha
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 21:35 ` Pasha Tatashin
@ 2026-04-24 22:21 ` Dave Hansen
2026-04-24 22:49 ` David Stevens
2026-04-24 22:26 ` David Laight
1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2026-04-24 22:21 UTC (permalink / raw)
To: Pasha Tatashin
Cc: David Stevens, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
linux-kernel, linux-mm
On 4/24/26 14:35, Pasha Tatashin wrote:
> On 04-24 12:41, Dave Hansen wrote:
>> On 4/24/26 12:14, David Stevens wrote:
>>> The question is then: is this approach something that is fundamentally
>>> untenable in the kernel
>> Yes. Fundamentally untenable.
>>
>> Not allowing stack faults has been a wonderful simplification. It's one
>> of those things that just plain makes the kernel easier to maintain.
>> Saving low single digits of system memory is not exactly making me eager
>> to go back to the harder-to-maintain days.
>>
>> I seriously doubt that this 1% is the lowest hanging fruit for memory
>> bloat on these systems. 😉
> This true until, in a fleet of millions of machines, you encounter a
> one-in-a-billion chance of a stack overflow. You are then forced to
> double the statically allocated kernel stacks on every machine, paying a
> memory tax even though 99.999..% of threads never exceed 4K. This
> overhead accumulates to petabytes of wasted capacity.
I don't disagree with you. But, at that point, you're picking your
poison: bugs dynamic kernel stacks versus crashes from stack overflows.
At some point, I might be able to be talked into dynamic stack as a
FRED-only feature. But FRED isn't widespread enough to go to the trouble
today. I'm sure the folks who want this also don't want to wait until
all the devices in the field have FRED because that even *longer* off.
So maybe this is one of those things that folks just need to deploy
out-of-tree for a couple of years, come back with some data to show us
that we were just paranoid, and we'll look at it again.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 22:21 ` Dave Hansen
@ 2026-04-24 22:49 ` David Stevens
0 siblings, 0 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 22:49 UTC (permalink / raw)
To: Dave Hansen
Cc: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
linux-kernel, linux-mm
On Fri, Apr 24, 2026 at 3:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 4/24/26 14:35, Pasha Tatashin wrote:
> > On 04-24 12:41, Dave Hansen wrote:
> >> On 4/24/26 12:14, David Stevens wrote:
> >>> The question is then: is this approach something that is fundamentally
> >>> untenable in the kernel
> >> Yes. Fundamentally untenable.
> >>
> >> Not allowing stack faults has been a wonderful simplification. It's one
> >> of those things that just plain makes the kernel easier to maintain.
> >> Saving low single digits of system memory is not exactly making me eager
> >> to go back to the harder-to-maintain days.
> >>
> >> I seriously doubt that this 1% is the lowest hanging fruit for memory
> >> bloat on these systems. 😉
> > This true until, in a fleet of millions of machines, you encounter a
> > one-in-a-billion chance of a stack overflow. You are then forced to
> > double the statically allocated kernel stacks on every machine, paying a
> > memory tax even though 99.999..% of threads never exceed 4K. This
> > overhead accumulates to petabytes of wasted capacity.
>
> I don't disagree with you. But, at that point, you're picking your
> poison: bugs dynamic kernel stacks versus crashes from stack overflows.
>
> At some point, I might be able to be talked into dynamic stack as a
> FRED-only feature. But FRED isn't widespread enough to go to the trouble
> today. I'm sure the folks who want this also don't want to wait until
> all the devices in the field have FRED because that even *longer* off.
Why does this need to be FRED only? True, the lack of reentrancy with
IST stacks complicates a few situations. That adds some complexity
beyond what's needed for FRED-only support, but the additional
complexity doesn't really seem like a hard blocker, at least if we
accept the complexity of kernel stack faults for FRED.
-David
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 21:35 ` Pasha Tatashin
2026-04-24 22:21 ` Dave Hansen
@ 2026-04-24 22:26 ` David Laight
2026-04-24 23:06 ` Pasha Tatashin
1 sibling, 1 reply; 21+ messages in thread
From: David Laight @ 2026-04-24 22:26 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Dave Hansen, David Stevens, Linus Walleij, Will Deacon,
Quentin Perret, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Xin Li,
Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook,
linux-kernel, linux-mm
On Fri, 24 Apr 2026 21:35:20 +0000
Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> On 04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:
> > > The question is then: is this approach something that is fundamentally
> > > untenable in the kernel
> >
> > Yes. Fundamentally untenable.
> >
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> >
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)
>
> This true until, in a fleet of millions of machines, you encounter a
> one-in-a-billion chance of a stack overflow. You are then forced to
> double the statically allocated kernel stacks on every machine, paying a
> memory tax even though 99.999..% of threads never exceed 4K. This
> overhead accumulates to petabytes of wasted capacity.
And then you hit a stack fault in some path where you can't sleep and
there isn't any available kernel memory.
An alternative idea is to arrange for some system calls to sleep in
userspace, so when the thread is woken it re-executes the system call.
It then makes sense to assign the kernel stack to the process when
it enters the kernel.
That might mean that you don't need a kernel stack for all the threads
sleeping in futex() - it might even be possible to do the retry in
userspace saving the second kernel entry most of the time.
It is all 'hard and difficult' though.
The easier solution is to rewrite the system code so it doesn't have
1000s of threads :-)
David
>
> Pasha
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 22:26 ` David Laight
@ 2026-04-24 23:06 ` Pasha Tatashin
0 siblings, 0 replies; 21+ messages in thread
From: Pasha Tatashin @ 2026-04-24 23:06 UTC (permalink / raw)
To: David Laight
Cc: Pasha Tatashin, Dave Hansen, David Stevens, Linus Walleij,
Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Andy Lutomirski, Xin Li, Peter Zijlstra, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Uladzislau Rezki, Kees Cook, linux-kernel, linux-mm, willy
On 04-24 23:26, David Laight wrote:
> On Fri, 24 Apr 2026 21:35:20 +0000
> Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> > On 04-24 12:41, Dave Hansen wrote:
> > > On 4/24/26 12:14, David Stevens wrote:
> > > > The question is then: is this approach something that is fundamentally
> > > > untenable in the kernel
> > >
> > > Yes. Fundamentally untenable.
> > >
> > > Not allowing stack faults has been a wonderful simplification. It's one
> > > of those things that just plain makes the kernel easier to maintain.
> > > Saving low single digits of system memory is not exactly making me eager
> > > to go back to the harder-to-maintain days.
> > >
> > > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > > bloat on these systems. ;)
> >
> > This true until, in a fleet of millions of machines, you encounter a
> > one-in-a-billion chance of a stack overflow. You are then forced to
> > double the statically allocated kernel stacks on every machine, paying a
> > memory tax even though 99.999..% of threads never exceed 4K. This
> > overhead accumulates to petabytes of wasted capacity.
>
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.
Well, at least if we hit this rare case, we can simply double a buffer
of pre-reserved stack memory per CPU. This still saves significant
memory compared to wasting it on every single thread.
> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.
> That might mean that you don't need a kernel stack for all the threads
> sleeping in futex() - it might even be possible to do the retry in
> userspace saving the second kernel entry most of the time.
> It is all 'hard and difficult' though.
I was thinking about a similar approach as well—sort of multiplexing the
kernel stacks. But honestly, when trying to cover all the edge cases, I
didn't find it to be any better or easier than just using dynamic kernel
stacks.
An alternative approach, which was proposed at LSFMM by Willy, is to add
an explicit deep stack calls. When we enter a path that we know is
exceptionally deep, only then do we extend the stack, keeping the
default (say, 8K) everywhere else.
> The easier solution is to rewrite the system code so it doesn't have
> 1000s of threads :-)
That ship sailed in the early 90s of the previous millennium. Nowadays,
we have high end workstations with almost 200 hardware threads.
Rewriting system code to reduce thread counts simply isn't an option for
our storage machines, which have millions of threads per unit.
+CC Matthew Wilcox
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 00/13] Dynamic Kernel Stacks
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35 ` Pasha Tatashin
@ 2026-04-25 9:19 ` H. Peter Anvin
1 sibling, 0 replies; 21+ messages in thread
From: H. Peter Anvin @ 2026-04-25 9:19 UTC (permalink / raw)
To: Dave Hansen, David Stevens, Pasha Tatashin, Linus Walleij,
Will Deacon, Quentin Perret, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Andy Lutomirski, Xin Li,
Peter Zijlstra, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
Cc: linux-kernel, linux-mm
On 2026-04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
>> The question is then: is this approach something that is fundamentally
>> untenable in the kernel
>
> Yes. Fundamentally untenable.
>
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
>
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)
It is worth noting that this was one of the VERY early design decisions that
has shaped Linux from the beginning:
- No swapping of kernel memory
- Kernel stacks are statically allocated
- Physical RAM is mapped into the kernel at all times
- A "monolithic" kernel using function calls, not message passing
- A kernel interface that closely maps to the low-level application API
(e.g. each user space thread is a kernel thread.)
- Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
in user space.
Those design decisions are, by and large, what has made Linux Linux: a
relatively simple, highly performant, and reliable system.
-hpa
^ permalink raw reply [flat|nested] 21+ messages in thread