public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v2 00/13] Dynamic Kernel Stacks
@ 2026-04-24 19:14 David Stevens
  2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
                   ` (13 more replies)
  0 siblings, 14 replies; 21+ messages in thread
From: David Stevens @ 2026-04-24 19:14 UTC (permalink / raw)
  To: Pasha Tatashin, Linus Walleij, Will Deacon, Quentin Perret,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Xin Li, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, Kees Cook
  Cc: David Stevens, linux-kernel, linux-mm

This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.

The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.

The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.

The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.

The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).

This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.

[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1

David Stevens (7):
  fork: Don't assume fully populated stack during reuse
  fork: Move vm_stack to the beginning of the stack
  fork: Move vmap stack freeing to work queue
  fork: Store task pointer in unpopulated stack ptes
  x86/entry/fred: encode frame pointer on entry
  x86: Add support for dynamic kernel stacks via FRED
  x86: Add support for dynamic kernel stacks via IST

Pasha Tatashin (6):
  fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  fork: separate vmap stack allocation and free calls
  mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
    functions
  fork: Dynamic Kernel Stacks
  task_stack.h: Add stack_not_used() support for dynamic stack
  fork: Dynamic Kernel Stack accounting

 arch/Kconfig                          |  38 ++
 arch/x86/Kconfig                      |   1 +
 arch/x86/entry/entry_64.S             |  49 ++-
 arch/x86/entry/entry_64_fred.S        |  57 +++
 arch/x86/include/asm/cpu_entry_area.h |  18 +
 arch/x86/include/asm/idtentry.h       |  38 +-
 arch/x86/include/asm/page_64_types.h  |  10 +-
 arch/x86/include/asm/pgtable_64.h     |  36 ++
 arch/x86/include/asm/processor.h      |   6 +
 arch/x86/include/asm/traps.h          |   5 +
 arch/x86/kernel/cpu/common.c          |  11 +
 arch/x86/kernel/dumpstack_64.c        |  10 +-
 arch/x86/kernel/fred.c                |  20 +-
 arch/x86/kernel/idt.c                 |  57 +--
 arch/x86/kernel/nmi.c                 |   9 +
 arch/x86/lib/usercopy.c               |   9 +
 arch/x86/mm/cpu_entry_area.c          |  17 +
 arch/x86/mm/dump_pagetables.c         |  14 +-
 arch/x86/mm/fault.c                   | 101 +++++-
 include/linux/mmzone.h                |   3 +
 include/linux/sched.h                 |  11 +-
 include/linux/sched/task_stack.h      |  48 ++-
 include/linux/vmalloc.h               |  14 +
 init/init_task.c                      |   4 +
 kernel/exit.c                         |  22 ++
 kernel/fork.c                         | 481 ++++++++++++++++++++++++--
 kernel/sched/core.c                   |   1 +
 mm/memcontrol.c                       |  10 +
 mm/vmalloc.c                          |  27 +-
 mm/vmstat.c                           |   3 +
 30 files changed, 1049 insertions(+), 81 deletions(-)


base-commit: 028ef9c96e96197026887c0f092424679298aae8
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-04-25  9:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-04-25  9:19   ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox