Re: [PATCH v2 00/13] Dynamic Kernel Stacks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>, Thomas Gleixner <tglx@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	David Stevens <stevensd@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	Will Deacon <willdeacon@google.com>,
	Quentin Perret <qperret@google.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
	Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Tue, 23 Jun 2026 09:50:00 +0200	[thread overview]
Message-ID: <d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org> (raw)
In-Reply-To: <CAAa6QmSeq8bbckyJk_5HFagsHfS5SXbG4y6Y-Py66eYLgvjcUg@mail.gmail.com>

On 6/23/26 01:00, Zach O'Keefe wrote:
> On Sat, Jun 20, 2026 at 4:34 PM Thomas Gleixner <tglx@kernel.org> wrote:
> 
> Thomas, thanks for taking the time, as always, for such a thoughtful response.
> 
>> On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
>>>
>>> Ya, that's my concern as well, as I don't have a good intuition for
>>> how perf critical kernel #PF is for real workloads. If this is your
>>> primary concern, I'll take that as a _good_ thing ; i.e. there's
>>> nothing architecturally stopping us from doing this downgrade safely.
>>> We'll still need the analysis, but that can be a later stage -- we're
>>> more than happy to get this data for all.
>>
>> No. That's not a later stage optional requirement.
>>
>> You have a PoC which works for you otherwise you wouldn't have posted
>> it. So you can trivially microbenchmark the costs of the
>> up/downgrade. And that's critical information for us but also for
>> you. If the costs are significant then you really have to think about
>> the tradeoffs.
>>
>> Care to read Documentation/process/* carefully? It applies to you as it
>> applies to anyone else.
>>
>>>
>>> This is actually the most understood aspect. With O(100B) active tasks
>>> fleetwide at any point, it only takes an average savings of O(10KiB)
>>> per task to get to 1PiB. At least for our fleet, we know the % of
>>> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>>> math confirms that we expect O(PiB) aggregate savings. The % of stacks
>>> requiring the full 16KiB is minuscule, but it still occurs at a rate
>>> higher than what we can tolerate for SO panics. Given the vast
>>> majority of stacks never exceed the first 4KiB, this enables the
>>> significant opportunity.
>>
>> I know that the potential savings are well understood and my
>> understanding of math is sufficient to calculate how much tasks and
>> average saving it takes to save 1PiB on a fleet.
>>
>> That's a no-brainer, but this is an aggregate saving, which sounds WOW
>> but does not tell much about anything else.
>>
>>  1) What's the actual percentage of savings in relation to the overall
>>     memory?
>>
>>  2) Does the saving allow you to get more stuff done on a machine, pack
>>     more threads on it?
>>
>>  3) Can you actually downsize the memory on the machines?
>>
>>  4) What is the performance tradeoff for that?
>>
>> IOW, you fail to tell what the actual benefit of such an intrusive
>> change is. Just boasting an aggregate Petabyte number does not tell
>> anything at all.
>>
>> Let me give you a trivial example with a scenario which I have access
>> to:
>>
>>     256  CPUs
>>     256  GiB Memory
>>     64k  Threads
>>
>> Let's assume the full saving of 12k per thread. That sums up to
>>
>>       64k * 12k = 768MB of memory
>>
>> which is 0.29% of the total 256 GiB of memory. Not so impressive as the
>> petabyte aggregate number, right?
>>
>> The workload consumes about 80% of the overall memory and is already
>> constraint on close to 100% CPU utilization.
>>
>> Now let's assume that the runtime overhead of this amounts to 1% then
>> this is a net loss.
>>
>> Let me turn that around and use a made up example assuming the 1Mio
>> threads per compute unit taken from some reply in this thread.
>>
>> Now the full saving of 12k per thread amounts to:
>>
>>     1M * 12k = 12G
>>
>> which is 4.7% of the overall available memory. Agreed that's a
>> substantial number.
>>
>> That 12G saving does not do anything in terms of hardware downsizing.
>>
>> The only way that has a benefit is when the system is constraint by
>> overall memory consumption, but has quite some compute capacity left.
>>
>> IOW, if 1M threads hit the memory limit that means that the savings in
>> kernel stack consumed memory allows you to add about 4% (~40k) more
>> threads. If that ups the CPU utilization accordingly then yes, I can see
>> the benefit. But TBH, if that's the case then you are trying to fix a
>> user space implementation problem in the kernel.
>>
>> That said you really have to describe the scenarios where there is a
>> benefit and I do not buy this "fleet level" argument at all because
>> there is no single fleet which has a uniform workload distribution.
> 
> These are good thoughts, thank you. Perhaps I've been too biased by
> our particular environment—apologies for that.
> 
> We (mostly) punt this problem to cluster-level scheduling, which
> ironically exploits this non-uniformity of workload dynamics to
> appropriately bin-pack machines and materialize these small savings.
> 
> In the general case, I guess a lot hinges on that overhead cost -- in
> the best (memory-constrained) case.
> 
>> Aside of that. If your argument holds that there are only a few
>> scenarios which require a deep stack, then we are better off to identify
>> them and fix them up rather than trying to hack around the occacional
>> insanity of deep stack usage by adding complexity for complexity sake.
>>
>> As you say that you have numbers of your fleet which confirm that the
>> vast majority of the stack depth is below 4k, you can surely figure out
>> the information which call chains are actually exceeding the limit.
>>
>> I prefer to fix such shitty code and downgrade the stacksize in general
>> instead of papering over the underlying issues which probably have been
>> ignored for years if not decades.
>>
>> Have you ever thought about that instead of adding complexity with a
>> dubious value?

There was some (hallway?) talk at LSF/MM about possibly removing direct reclaim,
similar to how other operating systems handle it. Now, I don't know how feasible
it is (I guess devil is in the detail ;) ), or any details how that would work,
but direct reclaim was repeatedly called out as one of the main reasons we can
get huge stacks.

So I guess direct reclaim (incl. compaction) is one of the main problematic
pieces. Are we aware of other scenarios where we (easily) trigger consumption of
larger stacks?

Wild idea: as a first step to test the waters, use smaller stacks on selected
kernel threads and disallow direct reclaim/compaction if the stack for the
thread is small?

-- 
Cheers,

David

next prev parent reply	other threads:[~2026-06-23  7:50 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-05-20 22:24   ` David Stevens
2026-05-22 22:25     ` H. Peter Anvin
2026-05-24 18:22       ` Xin Li
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-06-19  0:29       ` Dave Hansen
2026-06-19 19:56         ` Zach O'Keefe
2026-06-20  5:25         ` David Stevens
2026-06-20 23:22           ` Dave Hansen
2026-04-25  9:19   ` H. Peter Anvin
2026-04-27 16:17     ` Dave Hansen
2026-06-18 14:50       ` Zach O'Keefe
2026-06-18 18:53         ` Dave Hansen
2026-06-18 22:28           ` H. Peter Anvin
2026-06-19  0:40             ` David Stevens
2026-06-19  0:44               ` H. Peter Anvin
2026-06-19 12:45           ` Thomas Gleixner
2026-06-19 19:20             ` Zach O'Keefe
2026-06-19 21:59               ` Thomas Gleixner
2026-06-20  5:02                 ` David Stevens
2026-06-20 21:59                   ` Thomas Gleixner
2026-06-20 19:33                 ` Zach O'Keefe
2026-06-20 19:44                   ` H. Peter Anvin
2026-06-20 20:01                     ` Zach O'Keefe
2026-06-20 23:34                   ` Thomas Gleixner
2026-06-22 23:00                     ` Zach O'Keefe
2026-06-23  7:50                       ` David Hildenbrand (Arm) [this message]
2026-06-23  9:10                         ` David Laight
2026-06-23  9:19                           ` David Hildenbrand (Arm)
2026-04-27 16:31     ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=willdeacon@google.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.