Re: [PATCH v2 00/13] Dynamic Kernel Stacks

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	David Stevens <stevensd@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	Will Deacon <willdeacon@google.com>,
	Quentin Perret <qperret@google.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
	Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Fri, 19 Jun 2026 23:59:19 +0200	[thread overview]
Message-ID: <87qzm2b39k.ffs@fw13> (raw)
In-Reply-To: <CAAa6QmSHBDeY0G=_N1P4dAAH917J7jerfZrWDfDd8w=8jH8nVw@mail.gmail.com>

Zach!

On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> While it seems common opinion that the IST-based solution is fragile,
> what of FRED? It seems like this is exactly the kind of support needed
> to avoid some of the aforementioned sw "mess" in various x86 exception
> handling paths. I agree that it's less-than-ideal that we are forced
> to downgrade exception levels in the common #PF case, but is that an
> unsurmountable problem? Pardon my ignorance.

The #PF path is considered perfomance critical. But how much the
downgrade matters needs actual numbers to analyze under various workload
scenarios.

I've not seen numbers to that effect anywhere. The only numbers provided
are marketing material about the memory savings on a freshly booted idle
machine. There are _zero_ numbers about the actual real world savings,
but claims about the PETABYTE savings possible.

Seriously?

> Lastly, I just want to clarify what folks have meant by "extraordinary
> claims" or "evidence".  Aside from the above discussion on FRED
> exception handling, the "only" other part of this is the allocation.

Clearly anything which is explained with "shouldn't happen" and
"unlikely". At cloud scale nothing is unlikely anymore. That's simply the
reality of statistical math.

As I pointed out before the same applies to the unexplained
upgrade/downgrade game with external interrupts. Such issues cannot be
papered over without understanding the root cause as from decades long
experience they come inevitably back some time down the road. Cloud
scale even guarantees that.

> Are people concerned about memory unavailability, deadlocking-type
> issues, or something else? We have considerable design freedom here to
> avoid certain classes of unreliability, but—barring any clever
> tricks—I don't know if the allocation can be guaranteed to succeed in
> all conceivable circumstances. I want to ensure that reality does not
> present a hard blocker.

First of all the failure scenario has to be clearly defined.

Right now, if I'm reading the patches correctly this simply can end up
killing the wrong tasks/processes just because an OOM situation results
in a depletion of the per CPU cache and the very wrong task which runs
into the deep call stack situation ends up in the creek without a paddle.

Given that you even fail to abort a CPU bringup when the allocation of
the per CPU stack page cache fails, makes it pretty clear that there has
been spent exactly zero thoughts about this problem.

Why the heck does this cache refill call have to be unconditionally in
__schedule() where preemption is disabled and therefore GFP_ATOMIC
is mandatory? I know "Works for me" (most of the time).

And just because I was looking at the patch in question I found this
other insanity:

> +	/*
> +	 * Most likely we faulted in the page right next to the last mapped
> +	 * page in the stack, however, it is possible (but very unlikely) that
> +	 * the faulted page is actually skips some pages in the stack. Make sure
> +	 * we do not create  more than one holes in the stack, and map every
> +	 * page between the current fault  address and the last page that is
> +	 * mapped in the stack.
> +	 */

Can anyone with a sane mind and the most minimal understanding of the
kernel's inner working explain to me how the kernel can skip "some
pages" on the stack?

If the kernel skips a whole page or more then there is a serious bug
somewhere. I might be missing something, but again the "very unlikely"
wording which handwaves about it is just disgustingly useless.

I disagree with Dave on the RFC status of this series. It's not even
close to RFC, it's at PoC status.

Thanks,

        tglx

next prev parent reply	other threads:[~2026-06-19 21:59 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-05-20 22:24   ` David Stevens
2026-05-22 22:25     ` H. Peter Anvin
2026-05-24 18:22       ` Xin Li
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-06-19  0:29       ` Dave Hansen
2026-06-19 19:56         ` Zach O'Keefe
2026-06-20  5:25         ` David Stevens
2026-06-20 23:22           ` Dave Hansen
2026-04-25  9:19   ` H. Peter Anvin
2026-04-27 16:17     ` Dave Hansen
2026-06-18 14:50       ` Zach O'Keefe
2026-06-18 18:53         ` Dave Hansen
2026-06-18 22:28           ` H. Peter Anvin
2026-06-19  0:40             ` David Stevens
2026-06-19  0:44               ` H. Peter Anvin
2026-06-19 12:45           ` Thomas Gleixner
2026-06-19 19:20             ` Zach O'Keefe
2026-06-19 21:59               ` Thomas Gleixner [this message]
2026-06-20  5:02                 ` David Stevens
2026-06-20 21:59                   ` Thomas Gleixner
2026-06-20 19:33                 ` Zach O'Keefe
2026-06-20 19:44                   ` H. Peter Anvin
2026-06-20 20:01                     ` Zach O'Keefe
2026-06-20 23:34                   ` Thomas Gleixner
2026-04-27 16:31     ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87qzm2b39k.ffs@fw13 \
    --to=tglx@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=willdeacon@google.com \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox