From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF6CB388382
	for <linux-kernel@vger.kernel.org>; Fri, 19 Jun 2026 12:45:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781873137; cv=none; b=UVQVhFeghTZuqTxVObAM6YGhCgqiD4f4ZVEQOvEAUlGfGPLkR2dtxUOEXDA0JMfpKDy1vlql7/qmZLNRmpLyBvfjaZmkMiSog81ASAfIp1nFX8jYaiWULoPbyTqb/TE1CyazBHyoyakM/LTWj/403obfSpQFLy1tJeSYGLUwH7w=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781873137; c=relaxed/simple;
	bh=C/oz/WP1ZqNGOqBp7Ttwt9pL1QqcidIduo0OYImy5Jk=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=PDSBoG4VAb2oS6ShpiJpxMxJCh17/r1KUmcfRmfTEDxc7FwwqPo3ZX1A99zpO6+dxmn9dOwm3Jg1TDDrlnfLnpVJHW301sOyoGEOCsU1mEJE9k+OrUIBahuPr4FjNe6FyK1OTELyrXzS4STb76IUv7I/Q9/vLCX7W9HxZTvcuLU=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=crEsyyb4; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="crEsyyb4"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9EAF11F000E9;
	Fri, 19 Jun 2026 12:45:34 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781873135;
	bh=L/Vhz6lOiktHFsg9eoV4+oceyN6BbtxBaibjZtdCVyw=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date;
	b=crEsyyb4DPjpSRuGTlnWJ6/ztzR5JnGP9C/rROgh6jlFUllrBxDQaANCr39K6zTUj
	 EZfZwtqBaGCBJULOoxGCl1ETc8HjbrAF+JiwmihwwpJR+BytRx67bASip+9cQXTCnm
	 sov7vDyQt1oPjLb7Cv9yNPU22Slw/QYIPuFeRszoLlYXHhm9ydbpCzAUNOEXKM/L52
	 erekEPdF5PNV7UIX2U/C2JG+gZHJC9X3PSCEuriwiircCaLfXK6/yv08OW8PjeBqoI
	 IRpGFPDuEXq2pqW8LskhzGF4c7AIL06/SBdNhAl+dzhvANvoek/MLugG8Zru8ytbEK
	 O1Lkp9ddMZgpg==
From: Thomas Gleixner <tglx@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>, Zach O'Keefe <zokeefe@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>, David Stevens <stevensd@google.com>,
 Pasha Tatashin <pasha.tatashin@soleen.com>, Linus Walleij
 <linus.walleij@linaro.org>, Will Deacon <willdeacon@google.com>, Quentin
 Perret <qperret@google.com>, Ingo Molnar <mingo@redhat.com>, Borislav
 Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>,
 x86@kernel.org, Andy Lutomirski <luto@kernel.org>, Xin Li <xin@zytor.com>,
 Peter Zijlstra <peterz@infradead.org>, Andrew Morton
 <akpm@linux-foundation.org>, David Hildenbrand <david@kernel.org>, Lorenzo
 Stoakes <ljs@kernel.org>, "Liam R. Howlett" <Liam.Howlett@oracle.com>,
 Vlastimil Babka <vbabka@kernel.org>, Mike Rapoport <rppt@kernel.org>,
 Suren Baghdasaryan <surenb@google.com>, Michal Hocko <mhocko@suse.com>,
 Uladzislau Rezki <urezki@gmail.com>, Kees Cook <kees@kernel.org>,
 linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
In-Reply-To: <c070c4d6-a570-4eea-aca0-72eed319a198@intel.com>
References: <20260424191456.2679717-1-stevensd@google.com>
 <da9321ad-4198-494e-b9fa-30d69bd29be3@intel.com>
 <6369e5ce-74e3-4c68-8053-d7d7d21b6955@zytor.com>
 <dbeeea58-16cb-4383-b8e8-91a8ca84e88a@intel.com>
 <CAAa6QmRw6QLnVJ8+uvMV8ASreLXzSab5Jii3Ju11qCZYio6Few@mail.gmail.com>
 <c070c4d6-a570-4eea-aca0-72eed319a198@intel.com>
Date: Fri, 19 Jun 2026 14:45:31 +0200
Message-ID: <87pl1md7h0.ffs@fw13>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.

There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?

It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.

Correct.

Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.

  "It is possible for #MCE to occur on the #PF IST stack, but the #MCE
   handler shouldn't generate new #PFs. The reentrancy check on the #PF
   stack will trigger if any recoverable #MCEs do generate #PFs - if there
   are actually reports of it happening, we can address it then."

Seriously?

We don't wait until the report comes in because the report won't even
happen in the worst case:

       #PF on IST
         ...
         cmp    0, reentrance
         jne	abort

       #MC
          ...
          #PF rewinds #PF IST
          cmp   0, reentrance
          jne	abort		<- Not taken because #MC happened before
                                   it could be set.

IST is fundamentally not suitable for this and I'm sure there are more
holes in this.

I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.

The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.

I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.

Thanks,

        tglx