From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>, Thomas Gleixner <tglx@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
David Stevens <stevensd@google.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Linus Walleij <linus.walleij@linaro.org>,
Will Deacon <willdeacon@google.com>,
Quentin Perret <qperret@google.com>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Tue, 23 Jun 2026 09:50:00 +0200 [thread overview]
Message-ID: <d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org> (raw)
In-Reply-To: <CAAa6QmSeq8bbckyJk_5HFagsHfS5SXbG4y6Y-Py66eYLgvjcUg@mail.gmail.com>
On 6/23/26 01:00, Zach O'Keefe wrote:
> On Sat, Jun 20, 2026 at 4:34 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> Thomas, thanks for taking the time, as always, for such a thoughtful response.
>
>> On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
>>>
>>> Ya, that's my concern as well, as I don't have a good intuition for
>>> how perf critical kernel #PF is for real workloads. If this is your
>>> primary concern, I'll take that as a _good_ thing ; i.e. there's
>>> nothing architecturally stopping us from doing this downgrade safely.
>>> We'll still need the analysis, but that can be a later stage -- we're
>>> more than happy to get this data for all.
>>
>> No. That's not a later stage optional requirement.
>>
>> You have a PoC which works for you otherwise you wouldn't have posted
>> it. So you can trivially microbenchmark the costs of the
>> up/downgrade. And that's critical information for us but also for
>> you. If the costs are significant then you really have to think about
>> the tradeoffs.
>>
>> Care to read Documentation/process/* carefully? It applies to you as it
>> applies to anyone else.
>>
>>>
>>> This is actually the most understood aspect. With O(100B) active tasks
>>> fleetwide at any point, it only takes an average savings of O(10KiB)
>>> per task to get to 1PiB. At least for our fleet, we know the % of
>>> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>>> math confirms that we expect O(PiB) aggregate savings. The % of stacks
>>> requiring the full 16KiB is minuscule, but it still occurs at a rate
>>> higher than what we can tolerate for SO panics. Given the vast
>>> majority of stacks never exceed the first 4KiB, this enables the
>>> significant opportunity.
>>
>> I know that the potential savings are well understood and my
>> understanding of math is sufficient to calculate how much tasks and
>> average saving it takes to save 1PiB on a fleet.
>>
>> That's a no-brainer, but this is an aggregate saving, which sounds WOW
>> but does not tell much about anything else.
>>
>> 1) What's the actual percentage of savings in relation to the overall
>> memory?
>>
>> 2) Does the saving allow you to get more stuff done on a machine, pack
>> more threads on it?
>>
>> 3) Can you actually downsize the memory on the machines?
>>
>> 4) What is the performance tradeoff for that?
>>
>> IOW, you fail to tell what the actual benefit of such an intrusive
>> change is. Just boasting an aggregate Petabyte number does not tell
>> anything at all.
>>
>> Let me give you a trivial example with a scenario which I have access
>> to:
>>
>> 256 CPUs
>> 256 GiB Memory
>> 64k Threads
>>
>> Let's assume the full saving of 12k per thread. That sums up to
>>
>> 64k * 12k = 768MB of memory
>>
>> which is 0.29% of the total 256 GiB of memory. Not so impressive as the
>> petabyte aggregate number, right?
>>
>> The workload consumes about 80% of the overall memory and is already
>> constraint on close to 100% CPU utilization.
>>
>> Now let's assume that the runtime overhead of this amounts to 1% then
>> this is a net loss.
>>
>> Let me turn that around and use a made up example assuming the 1Mio
>> threads per compute unit taken from some reply in this thread.
>>
>> Now the full saving of 12k per thread amounts to:
>>
>> 1M * 12k = 12G
>>
>> which is 4.7% of the overall available memory. Agreed that's a
>> substantial number.
>>
>> That 12G saving does not do anything in terms of hardware downsizing.
>>
>> The only way that has a benefit is when the system is constraint by
>> overall memory consumption, but has quite some compute capacity left.
>>
>> IOW, if 1M threads hit the memory limit that means that the savings in
>> kernel stack consumed memory allows you to add about 4% (~40k) more
>> threads. If that ups the CPU utilization accordingly then yes, I can see
>> the benefit. But TBH, if that's the case then you are trying to fix a
>> user space implementation problem in the kernel.
>>
>> That said you really have to describe the scenarios where there is a
>> benefit and I do not buy this "fleet level" argument at all because
>> there is no single fleet which has a uniform workload distribution.
>
> These are good thoughts, thank you. Perhaps I've been too biased by
> our particular environment—apologies for that.
>
> We (mostly) punt this problem to cluster-level scheduling, which
> ironically exploits this non-uniformity of workload dynamics to
> appropriately bin-pack machines and materialize these small savings.
>
> In the general case, I guess a lot hinges on that overhead cost -- in
> the best (memory-constrained) case.
>
>> Aside of that. If your argument holds that there are only a few
>> scenarios which require a deep stack, then we are better off to identify
>> them and fix them up rather than trying to hack around the occacional
>> insanity of deep stack usage by adding complexity for complexity sake.
>>
>> As you say that you have numbers of your fleet which confirm that the
>> vast majority of the stack depth is below 4k, you can surely figure out
>> the information which call chains are actually exceeding the limit.
>>
>> I prefer to fix such shitty code and downgrade the stacksize in general
>> instead of papering over the underlying issues which probably have been
>> ignored for years if not decades.
>>
>> Have you ever thought about that instead of adding complexity with a
>> dubious value?
There was some (hallway?) talk at LSF/MM about possibly removing direct reclaim,
similar to how other operating systems handle it. Now, I don't know how feasible
it is (I guess devil is in the detail ;) ), or any details how that would work,
but direct reclaim was repeatedly called out as one of the main reasons we can
get huge stacks.
So I guess direct reclaim (incl. compaction) is one of the main problematic
pieces. Are we aware of other scenarios where we (easily) trigger consumption of
larger stacks?
Wild idea: as a first step to test the waters, use smaller stacks on selected
kernel threads and disallow direct reclaim/compaction if the stack for the
thread is small?
--
Cheers,
David
next prev parent reply other threads:[~2026-06-23 7:50 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35 ` Pasha Tatashin
2026-04-24 22:21 ` Dave Hansen
2026-04-24 22:49 ` David Stevens
2026-04-24 22:26 ` David Laight
2026-04-24 23:06 ` Pasha Tatashin
2026-06-19 0:29 ` Dave Hansen
2026-06-19 19:56 ` Zach O'Keefe
2026-06-20 5:25 ` David Stevens
2026-06-20 23:22 ` Dave Hansen
2026-04-25 9:19 ` H. Peter Anvin
2026-04-27 16:17 ` Dave Hansen
2026-06-18 14:50 ` Zach O'Keefe
2026-06-18 18:53 ` Dave Hansen
2026-06-18 22:28 ` H. Peter Anvin
2026-06-19 0:40 ` David Stevens
2026-06-19 0:44 ` H. Peter Anvin
2026-06-19 12:45 ` Thomas Gleixner
2026-06-19 19:20 ` Zach O'Keefe
2026-06-19 21:59 ` Thomas Gleixner
2026-06-20 5:02 ` David Stevens
2026-06-20 21:59 ` Thomas Gleixner
2026-06-20 19:33 ` Zach O'Keefe
2026-06-20 19:44 ` H. Peter Anvin
2026-06-20 20:01 ` Zach O'Keefe
2026-06-20 23:34 ` Thomas Gleixner
2026-06-22 23:00 ` Zach O'Keefe
2026-06-23 7:50 ` David Hildenbrand (Arm) [this message]
2026-06-23 9:10 ` David Laight
2026-06-23 9:19 ` David Hildenbrand (Arm)
2026-04-27 16:31 ` Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=kees@kernel.org \
--cc=linus.walleij@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=luto@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterz@infradead.org \
--cc=qperret@google.com \
--cc=rppt@kernel.org \
--cc=stevensd@google.com \
--cc=surenb@google.com \
--cc=tglx@kernel.org \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
--cc=willdeacon@google.com \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=xin@zytor.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox