From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C91C434DCE3 for ; Fri, 24 Apr 2026 19:16:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058210; cv=none; b=MmS8Yc9itDA2SZJ1h5aLCTLkBfPxRpGcfvnDjYnMoKooz2XpnWRC/GIOPWm4LfpwQYSmt78ohjiGj3EzlvaCkOOD0uWNWDQsFIPQdofoOm2RT4ij3Qd0E22JeK+TdjQK6hm+RmNi4JjFsb5K7AAycSlwUsVY+WtaZgfL8/cucuQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058210; c=relaxed/simple; bh=JMIMGTkpmbdOoNNALGdsGJT/cAWnfaQ85ju/rCGCumI=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=GsOW5imYcPWzoU5FODSbptBSNPdMNFlrGbbIZasyXFz/iw/M4hvTDys9Z9uAFZGgCAhyiSEaDsSkn2PEt9+lMJQ+EqlmzqhADa2rnunKoRa8fGUmyA8Y8agbAf5l2oRC4loYdd6U13KzF2ETAZYDM+rtqlrecQ3+znw9zAU8+EQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EXoomurB; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EXoomurB" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2ba9a744f7dso10721200eec.0 for ; Fri, 24 Apr 2026 12:16:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058208; x=1777663008; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=f9WK6BWUmjOk8kEPBSMznnbLTmjbHc618NOrMFLB7k0=; b=EXoomurB3+DWo2h7OsY4XcokgrqOwnh2Cqbf3qUuOtaMuCThjFLDpxe/4chXnjjFMc lbxX82a6aGC88ktIToFs+PQjDUWkQ3tQTbTdE/qcntPothvbLejIzodJBZTnnp/ADbNF ssHQAxHIWGfNfwRu2gEqjg7uzU0GCVfRcO8bsX8KDHsswTyjyIZvX9VV8KgBKGoU10cc /D5gun+x2QZFK94GissVQDGhaABRz+jUuViaK4RmDsVsGOqMeXJM8L2RcGSPbt6OO4v7 CUxxVq0iAwkOlKElErLaM3V/V48Bo8BcqybnVEed7HcD2lOB87mYCyEDKp49G01x84vJ dt9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058208; x=1777663008; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=f9WK6BWUmjOk8kEPBSMznnbLTmjbHc618NOrMFLB7k0=; b=FgtFeZllxpdt9n97Ods0no22ffLr6b+n+kbg/p0grPxSD9ofAd1grTY325Usel+IwX RpAuAeIC3YaiBK4WP0O5P5N6Fyn4RZBeoh7Dx/+bypyMektbMHWcgWxwNIPaD93Io1Fo U8NilukdWykfbihbksBg86pAI5nJGKR74l0DJm9vSubm3szItUxHAOfwIxv5MMhdepar iGWC8Wwli//RqKWnuKaf5p31r3cFRm2Glukzw6+qR38lGZhKFoxZyx4Id4+pbq5U0Jv7 Fwck0MDCXj0dMDhZCFAbR74WUhg7P+tMwOceuNICQwF19o8Nb/NLRXqIIQBBNUW33WFI hz0A== X-Forwarded-Encrypted: i=1; AFNElJ9VL+5cphk82q/OtDxZT4jTqwwEDQ+srKzosJXUNz0SeVyeTw3lUbcec9OShQWagEWZ0Pyo3E6XNZ7Xe8o=@vger.kernel.org X-Gm-Message-State: AOJu0YwzIT+kMCNtahDqEw4rS63UM7Gtk6dIH4pZggwEdmSrBpOAzv2r CWr7hvfLbv9ZjBGqQaHikr+QemMM1wU8+Y327jCGhlcpgvIEI1CncnystzZt5lodIekMJe5Tghf SUyu3XKTVmly8qA== X-Received: from dybbs6.prod.google.com ([2002:a05:7300:a206:b0:2df:75ae:2719]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7300:230c:b0:2d2:c60d:4fe5 with SMTP id 5a478bee46e88-2e464ea4e10mr16505010eec.6.1777058207512; Fri, 24 Apr 2026 12:16:47 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:43 -0700 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-1-stevensd@google.com> Subject: [PATCH v2 00/13] Dynamic Kernel Stacks From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" This RFC is a continuation of Pasha Tatashin's original RFC [1], and is based on Linus Walleij's rebased version of the patches [2]. My focus was x86_64 devices, so I didn't include his arm64 WIP patches. The impetus for reviving this RFC is kernel stack usage on Android. On regular Android (i.e. non-wear/automotive), system processes typically have 2000-3000 threads. When adding threads from app processes, this means that systems with 4GB of memory are using 1-2% of total memory for kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%. The main change compared to Pasha's v1 RFC is how x86_64 handles kernel stack faults. On systems where FRED is available, it handles kernel page faults on stack level 1. When FRED isn't available, it uses a dedicated IST stack for page faults. In both cases, page faults which aren't dynamic stack faults are moved back onto the regular kernel stack. This does introduce some overhead for page faults on user memory that originate in the kernel (note that non-FRED systems already needed to bounce userspace page faults through the entry stack), but such faults aren't as hot a path as regular user page faults. There are certainly systems where the memory savings are worth the overhead. That said, the config could be made optional to give systems the option to pay the memory cost to avoid the CPU overhead. The biggest open issue is how to deal with reliability. This series uses GFP_ATOMIC when refilling the per-CPU magazines during context switch, which is necessary to avoid deadlock. This of course raises concerns about allocation failure. If a magazine got depleted, then refilling the magazine failed due to atomic reserve depletion, and then another thread triggered a dynamic stack fault, that would trigger a fatal page fault. There is also a secondary concern about additional pressure on the memory reserves causing allocation failures at other atomic call sites. The question is then: is this approach something that is fundamentally untenable in the kernel, or are there compromises that would allow it to be merged? One obvious compromise is to make the feature optional. Both kernel stack faults and running out of memory reserves are rare events. I've never seen this failure in my testing, although I don't have field data to back that up at this point. Some sysadmins may view it as low enough risk to be worth the memory savings. There are also additional measures that could be taken to reduce the likelihood of failure (e.g. magazine management on kernel entry/exit, tunable magazine sizes, adding best-effort trylock reclaim or oom kill). This series was developed and tested on devices running 6.18 kernels. It has been rebased onto 7.0, with minimal smoke testing after rebasing. [1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1 David Stevens (7): fork: Don't assume fully populated stack during reuse fork: Move vm_stack to the beginning of the stack fork: Move vmap stack freeing to work queue fork: Store task pointer in unpopulated stack ptes x86/entry/fred: encode frame pointer on entry x86: Add support for dynamic kernel stacks via FRED x86: Add support for dynamic kernel stacks via IST Pasha Tatashin (6): fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE fork: separate vmap stack allocation and free calls mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions fork: Dynamic Kernel Stacks task_stack.h: Add stack_not_used() support for dynamic stack fork: Dynamic Kernel Stack accounting arch/Kconfig | 38 ++ arch/x86/Kconfig | 1 + arch/x86/entry/entry_64.S | 49 ++- arch/x86/entry/entry_64_fred.S | 57 +++ arch/x86/include/asm/cpu_entry_area.h | 18 + arch/x86/include/asm/idtentry.h | 38 +- arch/x86/include/asm/page_64_types.h | 10 +- arch/x86/include/asm/pgtable_64.h | 36 ++ arch/x86/include/asm/processor.h | 6 + arch/x86/include/asm/traps.h | 5 + arch/x86/kernel/cpu/common.c | 11 + arch/x86/kernel/dumpstack_64.c | 10 +- arch/x86/kernel/fred.c | 20 +- arch/x86/kernel/idt.c | 57 +-- arch/x86/kernel/nmi.c | 9 + arch/x86/lib/usercopy.c | 9 + arch/x86/mm/cpu_entry_area.c | 17 + arch/x86/mm/dump_pagetables.c | 14 +- arch/x86/mm/fault.c | 101 +++++- include/linux/mmzone.h | 3 + include/linux/sched.h | 11 +- include/linux/sched/task_stack.h | 48 ++- include/linux/vmalloc.h | 14 + init/init_task.c | 4 + kernel/exit.c | 22 ++ kernel/fork.c | 481 ++++++++++++++++++++++++-- kernel/sched/core.c | 1 + mm/memcontrol.c | 10 + mm/vmalloc.c | 27 +- mm/vmstat.c | 3 + 30 files changed, 1049 insertions(+), 81 deletions(-) base-commit: 028ef9c96e96197026887c0f092424679298aae8 -- 2.54.0.rc2.544.gc7ae2d5bb8-goog