Re: [PATCH v2 00/13] Dynamic Kernel Stacks

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>, Thomas Gleixner <tglx@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	David Stevens <stevensd@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	Will Deacon <willdeacon@google.com>,
	Quentin Perret <qperret@google.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
	Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Tue, 23 Jun 2026 09:50:00 +0200	[thread overview]
Message-ID: <d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org> (raw)
In-Reply-To: <CAAa6QmSeq8bbckyJk_5HFagsHfS5SXbG4y6Y-Py66eYLgvjcUg@mail.gmail.com>

On 6/23/26 01:00, Zach O'Keefe wrote:
> On Sat, Jun 20, 2026 at 4:34 PM Thomas Gleixner <tglx@kernel.org> wrote:
> 
> Thomas, thanks for taking the time, as always, for such a thoughtful response.
> 
>> On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
>>>
>>> Ya, that's my concern as well, as I don't have a good intuition for
>>> how perf critical kernel #PF is for real workloads. If this is your
>>> primary concern, I'll take that as a _good_ thing ; i.e. there's
>>> nothing architecturally stopping us from doing this downgrade safely.
>>> We'll still need the analysis, but that can be a later stage -- we're
>>> more than happy to get this data for all.
>>
>> No. That's not a later stage optional requirement.
>>
>> You have a PoC which works for you otherwise you wouldn't have posted
>> it. So you can trivially microbenchmark the costs of the
>> up/downgrade. And that's critical information for us but also for
>> you. If the costs are significant then you really have to think about
>> the tradeoffs.
>>
>> Care to read Documentation/process/* carefully? It applies to you as it
>> applies to anyone else.
>>
>>>
>>> This is actually the most understood aspect. With O(100B) active tasks
>>> fleetwide at any point, it only takes an average savings of O(10KiB)
>>> per task to get to 1PiB. At least for our fleet, we know the % of
>>> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>>> math confirms that we expect O(PiB) aggregate savings. The % of stacks
>>> requiring the full 16KiB is minuscule, but it still occurs at a rate
>>> higher than what we can tolerate for SO panics. Given the vast
>>> majority of stacks never exceed the first 4KiB, this enables the
>>> significant opportunity.
>>
>> I know that the potential savings are well understood and my
>> understanding of math is sufficient to calculate how much tasks and
>> average saving it takes to save 1PiB on a fleet.
>>
>> That's a no-brainer, but this is an aggregate saving, which sounds WOW
>> but does not tell much about anything else.
>>
>>  1) What's the actual percentage of savings in relation to the overall
>>     memory?
>>
>>  2) Does the saving allow you to get more stuff done on a machine, pack
>>     more threads on it?
>>
>>  3) Can you actually downsize the memory on the machines?
>>
>>  4) What is the performance tradeoff for that?
>>
>> IOW, you fail to tell what the actual benefit of such an intrusive
>> change is. Just boasting an aggregate Petabyte number does not tell
>> anything at all.
>>
>> Let me give you a trivial example with a scenario which I have access
>> to:
>>
>>     256  CPUs
>>     256  GiB Memory
>>     64k  Threads
>>
>> Let's assume the full saving of 12k per thread. That sums up to
>>
>>       64k * 12k = 768MB of memory
>>
>> which is 0.29% of the total 256 GiB of memory. Not so impressive as the
>> petabyte aggregate number, right?
>>
>> The workload consumes about 80% of the overall memory and is already
>> constraint on close to 100% CPU utilization.
>>
>> Now let's assume that the runtime overhead of this amounts to 1% then
>> this is a net loss.
>>
>> Let me turn that around and use a made up example assuming the 1Mio
>> threads per compute unit taken from some reply in this thread.
>>
>> Now the full saving of 12k per thread amounts to:
>>
>>     1M * 12k = 12G
>>
>> which is 4.7% of the overall available memory. Agreed that's a
>> substantial number.
>>
>> That 12G saving does not do anything in terms of hardware downsizing.
>>
>> The only way that has a benefit is when the system is constraint by
>> overall memory consumption, but has quite some compute capacity left.
>>
>> IOW, if 1M threads hit the memory limit that means that the savings in
>> kernel stack consumed memory allows you to add about 4% (~40k) more
>> threads. If that ups the CPU utilization accordingly then yes, I can see
>> the benefit. But TBH, if that's the case then you are trying to fix a
>> user space implementation problem in the kernel.
>>
>> That said you really have to describe the scenarios where there is a
>> benefit and I do not buy this "fleet level" argument at all because
>> there is no single fleet which has a uniform workload distribution.
> 
> These are good thoughts, thank you. Perhaps I've been too biased by
> our particular environment—apologies for that.
> 
> We (mostly) punt this problem to cluster-level scheduling, which
> ironically exploits this non-uniformity of workload dynamics to
> appropriately bin-pack machines and materialize these small savings.
> 
> In the general case, I guess a lot hinges on that overhead cost -- in
> the best (memory-constrained) case.
> 
>> Aside of that. If your argument holds that there are only a few
>> scenarios which require a deep stack, then we are better off to identify
>> them and fix them up rather than trying to hack around the occacional
>> insanity of deep stack usage by adding complexity for complexity sake.
>>
>> As you say that you have numbers of your fleet which confirm that the
>> vast majority of the stack depth is below 4k, you can surely figure out
>> the information which call chains are actually exceeding the limit.
>>
>> I prefer to fix such shitty code and downgrade the stacksize in general
>> instead of papering over the underlying issues which probably have been
>> ignored for years if not decades.
>>
>> Have you ever thought about that instead of adding complexity with a
>> dubious value?

There was some (hallway?) talk at LSF/MM about possibly removing direct reclaim,
similar to how other operating systems handle it. Now, I don't know how feasible
it is (I guess devil is in the detail ;) ), or any details how that would work,
but direct reclaim was repeatedly called out as one of the main reasons we can
get huge stacks.

So I guess direct reclaim (incl. compaction) is one of the main problematic
pieces. Are we aware of other scenarios where we (easily) trigger consumption of
larger stacks?

Wild idea: as a first step to test the waters, use smaller stacks on selected
kernel threads and disallow direct reclaim/compaction if the stack for the
thread is small?

-- 
Cheers,

David

next prev parent reply	other threads:[~2026-06-23  7:50 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-06-19  0:29       ` Dave Hansen
2026-06-19 19:56         ` Zach O'Keefe
2026-06-20  5:25         ` David Stevens
2026-06-20 23:22           ` Dave Hansen
2026-04-25  9:19   ` H. Peter Anvin
2026-04-27 16:17     ` Dave Hansen
2026-06-18 14:50       ` Zach O'Keefe
2026-06-18 18:53         ` Dave Hansen
2026-06-18 22:28           ` H. Peter Anvin
2026-06-19  0:40             ` David Stevens
2026-06-19  0:44               ` H. Peter Anvin
2026-06-19 12:45           ` Thomas Gleixner
2026-06-19 19:20             ` Zach O'Keefe
2026-06-19 21:59               ` Thomas Gleixner
2026-06-20  5:02                 ` David Stevens
2026-06-20 21:59                   ` Thomas Gleixner
2026-06-20 19:33                 ` Zach O'Keefe
2026-06-20 19:44                   ` H. Peter Anvin
2026-06-20 20:01                     ` Zach O'Keefe
2026-06-20 23:34                   ` Thomas Gleixner
2026-06-22 23:00                     ` Zach O'Keefe
2026-06-23  7:50                       ` David Hildenbrand (Arm) [this message]
2026-06-23  9:10                         ` David Laight
2026-06-23  9:19                           ` David Hildenbrand (Arm)
2026-04-27 16:31     ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d4926c7a-32e4-498d-be6c-ab2969c8f672@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=willdeacon@google.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox