Re: [PATCH v2 00/13] Dynamic Kernel Stacks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>, Zach O'Keefe <zokeefe@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	David Stevens <stevensd@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Linus Walleij <linus.walleij@linaro.org>,
	Will Deacon <willdeacon@google.com>,
	Quentin Perret <qperret@google.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
	Xin Li <xin@zytor.com>, Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Date: Fri, 19 Jun 2026 14:45:31 +0200	[thread overview]
Message-ID: <87pl1md7h0.ffs@fw13> (raw)
In-Reply-To: <c070c4d6-a570-4eea-aca0-72eed319a198@intel.com>

On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.

There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?

It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.

Correct.

Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.

  "It is possible for #MCE to occur on the #PF IST stack, but the #MCE
   handler shouldn't generate new #PFs. The reentrancy check on the #PF
   stack will trigger if any recoverable #MCEs do generate #PFs - if there
   are actually reports of it happening, we can address it then."

Seriously?

We don't wait until the report comes in because the report won't even
happen in the worst case:

       #PF on IST
         ...
         cmp    0, reentrance
         jne	abort

       #MC
          ...
          #PF rewinds #PF IST
          cmp   0, reentrance
          jne	abort		<- Not taken because #MC happened before
                                   it could be set.

IST is fundamentally not suitable for this and I'm sure there are more
holes in this.

I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.

The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.

I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.

Thanks,

        tglx

next prev parent reply	other threads:[~2026-06-19 12:45 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH v2 00/13] Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE David Stevens
2026-04-24 19:14 ` [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse David Stevens
2026-04-24 19:14 ` [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack David Stevens
2026-04-24 19:14 ` [PATCH v2 04/13] fork: separate vmap stack allocation and free calls David Stevens
2026-04-24 19:14 ` [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions David Stevens
2026-04-24 19:14 ` [PATCH v2 06/13] fork: Move vmap stack freeing to work queue David Stevens
2026-04-24 19:14 ` [PATCH v2 07/13] fork: Dynamic Kernel Stacks David Stevens
2026-04-24 19:14 ` [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack David Stevens
2026-04-24 19:14 ` [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting David Stevens
2026-04-24 19:14 ` [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes David Stevens
2026-04-24 19:14 ` [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry David Stevens
2026-05-20 22:24   ` David Stevens
2026-05-22 22:25     ` H. Peter Anvin
2026-05-24 18:22       ` Xin Li
2026-04-24 19:14 ` [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED David Stevens
2026-04-24 19:14 ` [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST David Stevens
2026-04-24 19:41 ` [PATCH v2 00/13] Dynamic Kernel Stacks Dave Hansen
2026-04-24 21:35   ` Pasha Tatashin
2026-04-24 22:21     ` Dave Hansen
2026-04-24 22:49       ` David Stevens
2026-04-24 22:26     ` David Laight
2026-04-24 23:06       ` Pasha Tatashin
2026-06-19  0:29       ` Dave Hansen
2026-06-19 19:56         ` Zach O'Keefe
2026-06-20  5:25         ` David Stevens
2026-06-20 23:22           ` Dave Hansen
2026-04-25  9:19   ` H. Peter Anvin
2026-04-27 16:17     ` Dave Hansen
2026-06-18 14:50       ` Zach O'Keefe
2026-06-18 18:53         ` Dave Hansen
2026-06-18 22:28           ` H. Peter Anvin
2026-06-19  0:40             ` David Stevens
2026-06-19  0:44               ` H. Peter Anvin
2026-06-19 12:45           ` Thomas Gleixner [this message]
2026-06-19 19:20             ` Zach O'Keefe
2026-06-19 21:59               ` Thomas Gleixner
2026-06-20  5:02                 ` David Stevens
2026-06-20 21:59                   ` Thomas Gleixner
2026-06-20 19:33                 ` Zach O'Keefe
2026-06-20 19:44                   ` H. Peter Anvin
2026-06-20 20:01                     ` Zach O'Keefe
2026-06-20 23:34                   ` Thomas Gleixner
2026-04-27 16:31     ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pl1md7h0.ffs@fw13 \
    --to=tglx@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=willdeacon@google.com \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.