From: Thomas Gleixner <tglx@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
David Stevens <stevensd@google.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Linus Walleij <linus.walleij@linaro.org>,
Will Deacon <willdeacon@google.com>,
Quentin Perret <qperret@google.com>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Sun, 21 Jun 2026 01:34:47 +0200 [thread overview]
Message-ID: <87mrwon5uw.ffs@fw13> (raw)
In-Reply-To: <CAAa6QmTO=hhdJQa-ofSZ6wW0geLaEfWZumF6KmksxZqM3i33OA@mail.gmail.com>
On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
> On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@kernel.org> wrote:
>> The #PF path is considered perfomance critical. But how much the
>> downgrade matters needs actual numbers to analyze under various workload
>> scenarios.
>
> Ya, that's my concern as well, as I don't have a good intuition for
> how perf critical kernel #PF is for real workloads. If this is your
> primary concern, I'll take that as a _good_ thing ; i.e. there's
> nothing architecturally stopping us from doing this downgrade safely.
> We'll still need the analysis, but that can be a later stage -- we're
> more than happy to get this data for all.
No. That's not a later stage optional requirement.
You have a PoC which works for you otherwise you wouldn't have posted
it. So you can trivially microbenchmark the costs of the
up/downgrade. And that's critical information for us but also for
you. If the costs are significant then you really have to think about
the tradeoffs.
Care to read Documentation/process/* carefully? It applies to you as it
applies to anyone else.
>> I've not seen numbers to that effect anywhere. The only numbers provided
>> are marketing material about the memory savings on a freshly booted idle
>> machine. There are _zero_ numbers about the actual real world savings,
>> but claims about the PETABYTE savings possible.
>>
>> Seriously?
>
> This is actually the most understood aspect. With O(100B) active tasks
> fleetwide at any point, it only takes an average savings of O(10KiB)
> per task to get to 1PiB. At least for our fleet, we know the % of
> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
> math confirms that we expect O(PiB) aggregate savings. The % of stacks
> requiring the full 16KiB is minuscule, but it still occurs at a rate
> higher than what we can tolerate for SO panics. Given the vast
> majority of stacks never exceed the first 4KiB, this enables the
> significant opportunity.
I know that the potential savings are well understood and my
understanding of math is sufficient to calculate how much tasks and
average saving it takes to save 1PiB on a fleet.
That's a no-brainer, but this is an aggregate saving, which sounds WOW
but does not tell much about anything else.
1) What's the actual percentage of savings in relation to the overall
memory?
2) Does the saving allow you to get more stuff done on a machine, pack
more threads on it?
3) Can you actually downsize the memory on the machines?
4) What is the performance tradeoff for that?
IOW, you fail to tell what the actual benefit of such an intrusive
change is. Just boasting an aggregate Petabyte number does not tell
anything at all.
Let me give you a trivial example with a scenario which I have access
to:
256 CPUs
256 GiB Memory
64k Threads
Let's assume the full saving of 12k per thread. That sums up to
64k * 12k = 768MB of memory
which is 0.29% of the total 256 GiB of memory. Not so impressive as the
petabyte aggregate number, right?
The workload consumes about 80% of the overall memory and is already
constraint on close to 100% CPU utilization.
Now let's assume that the runtime overhead of this amounts to 1% then
this is a net loss.
Let me turn that around and use a made up example assuming the 1Mio
threads per compute unit taken from some reply in this thread.
Now the full saving of 12k per thread amounts to:
1M * 12k = 12G
which is 4.7% of the overall available memory. Agreed that's a
substantial number.
That 12G saving does not do anything in terms of hardware downsizing.
The only way that has a benefit is when the system is constraint by
overall memory consumption, but has quite some compute capacity left.
IOW, if 1M threads hit the memory limit that means that the savings in
kernel stack consumed memory allows you to add about 4% (~40k) more
threads. If that ups the CPU utilization accordingly then yes, I can see
the benefit. But TBH, if that's the case then you are trying to fix a
user space implementation problem in the kernel.
That said you really have to describe the scenarios where there is a
benefit and I do not buy this "fleet level" argument at all because
there is no single fleet which has a uniform workload distribution.
Aside of that. If your argument holds that there are only a few
scenarios which require a deep stack, then we are better off to identify
them and fix them up rather than trying to hack around the occacional
insanity of deep stack usage by adding complexity for complexity sake.
As you say that you have numbers of your fleet which confirm that the
vast majority of the stack depth is below 4k, you can surely figure out
the information which call chains are actually exceeding the limit.
I prefer to fix such shitty code and downgrade the stacksize in general
instead of papering over the underlying issues which probably have been
ignored for years if not decades.
Have you ever thought about that instead of adding complexity with a
dubious value?
Thanks,
tglx
next prev parent reply other threads:[~2026-06-20 23:34 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35 ` Pasha Tatashin
2026-04-24 22:21 ` Dave Hansen
2026-04-24 22:49 ` David Stevens
2026-04-24 22:26 ` David Laight
2026-04-24 23:06 ` Pasha Tatashin
2026-06-19 0:29 ` Dave Hansen
2026-06-19 19:56 ` Zach O'Keefe
2026-06-20 5:25 ` David Stevens
2026-06-20 23:22 ` Dave Hansen
2026-04-25 9:19 ` H. Peter Anvin
2026-04-27 16:17 ` Dave Hansen
2026-06-18 14:50 ` Zach O'Keefe
2026-06-18 18:53 ` Dave Hansen
2026-06-18 22:28 ` H. Peter Anvin
2026-06-19 0:40 ` David Stevens
2026-06-19 0:44 ` H. Peter Anvin
2026-06-19 12:45 ` Thomas Gleixner
2026-06-19 19:20 ` Zach O'Keefe
2026-06-19 21:59 ` Thomas Gleixner
2026-06-20 5:02 ` David Stevens
2026-06-20 21:59 ` Thomas Gleixner
2026-06-20 19:33 ` Zach O'Keefe
2026-06-20 19:44 ` H. Peter Anvin
2026-06-20 20:01 ` Zach O'Keefe
2026-06-20 23:34 ` Thomas Gleixner [this message]
2026-04-27 16:31 ` Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87mrwon5uw.ffs@fw13 \
--to=tglx@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=hpa@zytor.com \
--cc=kees@kernel.org \
--cc=linus.walleij@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=luto@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterz@infradead.org \
--cc=qperret@google.com \
--cc=rppt@kernel.org \
--cc=stevensd@google.com \
--cc=surenb@google.com \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
--cc=willdeacon@google.com \
--cc=x86@kernel.org \
--cc=xin@zytor.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox