From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 48782C5475B
	for <linux-mm@archiver.kernel.org>; Mon, 11 Mar 2024 23:34:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D9B3A6B015C; Mon, 11 Mar 2024 19:34:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D4ADA6B015D; Mon, 11 Mar 2024 19:34:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B77428D0008; Mon, 11 Mar 2024 19:34:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id A38996B015C
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 19:34:45 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 56064160451
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 23:34:45 +0000 (UTC)
X-FDA: 81886365330.28.95C80A5
Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55])
	by imf25.hostedemail.com (Postfix) with ESMTP id 9073DA0007
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 23:34:42 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=D2ebXaYQ;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf25.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710200083;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=4Ge18jz+Rgxa+6R5HyEUZbf9o8gifJJTv+63zrXqJ5o=;
	b=P3ZGF3dFciLpA3X1CuxYSgo60TZ8lNdFU2N8Sh2thrvIogDnvaf1SzmPFuL6b+MGlrFquk
	OcUdxUhlp8ruGq5jD8mGMtk0Wqgw8az2WyncSuAgEGiJaQlBcNy2BC/Ooze1t8yMa4B3Zd
	gAquPrno74gVn7sQ9hmA6tbQd0fNa2M=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=D2ebXaYQ;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf25.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710200083; a=rsa-sha256;
	cv=none;
	b=ECpsd2SN11lWwMlVpNQ/cSlSNk+WWmlDJp7L184V19dalzvalaBKmooLfFttoeezSBWra8
	wnz7bibLkBWlKJ6p0cr0COjpVffr4ueTGQZu9lvwv6O4ujpwLcNUPfMajdTzDUaoTSbnJh
	VPWqSA0jPPBayBhGsQts1QNBYiLzuBk=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by sin.source.kernel.org (Postfix) with ESMTP id C7795CE13BD;
	Mon, 11 Mar 2024 23:34:37 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id C1A5DC433C7;
	Mon, 11 Mar 2024 23:34:35 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1710200077;
	bh=ej4S/4faguy0dKqkF25qAMIA8oIYdtEceu7HJBeeSnE=;
	h=In-Reply-To:References:Date:From:To:Cc:Subject:From;
	b=D2ebXaYQbCKep5DIQpI35VTCXzJAM4C+4wogbK8JFgBNaLTx8SQDGQv8TEBNz6yii
	 mQ2z4Ax2dmoV5RYJtxIznCaBOUSZdX665U7olEmKFkMqsBh98emMt6c9jZnzwQcCrB
	 +ApX6+r9zL28qOMPs8gQYfpJMJkGk+hEUSmr0KXv5LBq/DDtBJrjb36X11N4hPPn5G
	 S7fj9sze7579mzNk2FM/1s/uIcvuUnh1Dusgp/H4jtGbNaPIbtEUKVjk53Q1KrLyED
	 UOjfi84KgZM4QGbw+wnSaPeWKuS27hz+z1kaA4blY85a2SGRka7ye2irVxSLGZ7MxW
	 Sqi5ECazOKzxA==
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
	by mailfauth.nyi.internal (Postfix) with ESMTP id 9E34E1200043;
	Mon, 11 Mar 2024 19:34:34 -0400 (EDT)
Received: from imap48 ([10.202.2.98])
  by compute3.internal (MEProxy); Mon, 11 Mar 2024 19:34:34 -0400
X-ME-Sender: <xms:CZXvZSuob1DfOXMKAaYrsWLtKGZmZp2GYr6hxdWBmeC4cMwk5deCfA>
    <xme:CZXvZXdYOyrSqDqhcbw_mF_nRDbof5ppM1oF5R_bACQwu-HnlhYKaW_4iYlMeACj0
    SB3NX-3bc6CMtNDHoU>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrjedvgddufecutefuodetggdotefrodftvf
    curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu
    uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc
    fjughrpefofgggkfgjfhffhffvvefutgfgsehtqhertderreejnecuhfhrohhmpedftehn
    ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg
    ftrfgrthhtvghrnhepudevffdvgedvfefhgeejjeelgfdtffeukedugfekuddvtedvudei
    leeugfejgefgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh
    homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek
    heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh
    igrdhluhhtohdruhhs
X-ME-Proxy: <xmx:CZXvZdzs0wNorA5qOtF_RdPja_POV01oHMsMvL7b3yNUC4jcUhlSPg>
    <xmx:CZXvZdPtvG8iXQ6tzDyrZ4xrqGLhGNBm64TPzpcU_v1AZg4Ou8uhlA>
    <xmx:CZXvZS-Z5MWXlB9Nsl-Y3k-uYeb3ijf5gA5fVRE2_GTym-lf-_uOnw>
    <xmx:CpXvZT9sPWITwIn0fN8UpDRkwW2I69viwgXtKzBQ2q1iJOR1nCQfQOiXQJUfdTY9>
Feedback-ID: ieff94742:Fastmail
Received: by mailuser.nyi.internal (Postfix, from userid 501)
	id 0777331A0065; Mon, 11 Mar 2024 19:34:33 -0400 (EDT)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.11.0-alpha0-251-g8332da0bf6-fm-20240305.001-g8332da0b
MIME-Version: 1.0
Message-Id: <1ac305b1-d28f-44f6-88e5-c85d9062f9e8@app.fastmail.com>
In-Reply-To: 
 <CA+CK2bA22AP2jrbHjdN8nYFbYX2xJXQt+=4G3Rjw_Lyn5NOyKA@mail.gmail.com>
References: <20240311164638.2015063-1-pasha.tatashin@soleen.com>
 <20240311164638.2015063-12-pasha.tatashin@soleen.com>
 <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com>
 <CA+CK2bA22AP2jrbHjdN8nYFbYX2xJXQt+=4G3Rjw_Lyn5NOyKA@mail.gmail.com>
Date: Mon, 11 Mar 2024 16:34:11 -0700
From: "Andy Lutomirski" <luto@kernel.org>
To: "Pasha Tatashin" <pasha.tatashin@soleen.com>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
 linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>,
 "the arch/x86 maintainers" <x86@kernel.org>,
 "Borislav Petkov" <bp@alien8.de>,
 "Christian Brauner" <brauner@kernel.org>, bristot@redhat.com,
 "Ben Segall" <bsegall@google.com>,
 "Dave Hansen" <dave.hansen@linux.intel.com>, dianders@chromium.org,
 dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com,
 "hch@infradead.org" <hch@infradead.org>,
 "H. Peter Anvin" <hpa@zytor.com>,
 "Jacob Pan" <jacob.jun.pan@linux.intel.com>,
 "Jason Gunthorpe" <jgg@ziepe.ca>, jpoimboe@kernel.org,
 "Joerg Roedel" <jroedel@suse.de>, juri.lelli@redhat.com,
 "Kent Overstreet" <kent.overstreet@linux.dev>, kinseyho@google.com,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net,
 michael.christie@oracle.com, "Ingo Molnar" <mingo@redhat.com>,
 mjguzik@gmail.com, "Michael S. Tsirkin" <mst@redhat.com>,
 "Nicholas Piggin" <npiggin@gmail.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
 "Petr Mladek" <pmladek@suse.com>,
 "Rick P Edgecombe" <rick.p.edgecombe@intel.com>,
 "Steven Rostedt" <rostedt@goodmis.org>,
 "Suren Baghdasaryan" <surenb@google.com>,
 "Thomas Gleixner" <tglx@linutronix.de>,
 "Uladzislau Rezki" <urezki@gmail.com>, vincent.guittot@linaro.org,
 vschneid@redhat.com
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks
Content-Type: text/plain;charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: rmx54opipybwhrqruyqzdw8qp9i9ein5
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 9073DA0007
X-HE-Tag: 1710200082-133499
X-HE-Meta: U2FsdGVkX190PQOclK1hTxNgFy3ctx2WCBGboG14y4yMFLujc8tGVC/yvp4+FFZZa9tflVW2MhnLlCvjSdWYf8ZNnDDJrDghRDRhMzZ4Jx0U50zmoP9MrteAZz1q4/IqPvuaD9sJnTMFd/annl36Q4uyQr68pRxqJQrUwThMdPKS/5GkSHUvPWB9GcGqv4TxEVPlzxrtig/eoUwGoetC5iUypNDU7nC1qrYTjKAdXrEkTv+/te2oprkaxWUqEFMK09JzZNpgA6k/YDX0mtNY7rFzbbOBRBoXvq72gUsbatCw26AVQgXkVxtCxcBqZLZHzwqeErQzThl84cAAMSexoOh6B8rAGFaUvBgccJflL1VfYe+Bs7rFWPCQT1iCMRWgc6P4Rkn1BbzDj+dOg20F7y2/zQLPnlARzfNFkjpqf+8cdr0hEEZHUNH81aMonXfK/tjLDvJ4wNqXn79ZWOeKbRqs9ZRKTD/fBzP4sFzc2EwU++cQx4x6jE21Cp08A49H7mFezqsNhyuPqidp313P/vHKEtUYJ9UFWSRuRhHV+gmghFeXbITtlxOMK5DkLljUY+xh0PpNsIQO4wkB7qzpo9knKB1qlaS+zUOYDHDk6XITu+w9gGkZYwk7JI3r0b/LaWFykFxDd2Rl7labogL0GsQWKk/jAnRvSLGde0s8qek+cT2C9qDEbgap6uCDh1K9MoRX7f4KmUdaOlkOqLmPwu0utpyamETsAft4KGOGS0cxX6dCZUKEXT0kgv3pxCA77ns9mvvL0bxwuqngGlKBxzHKDgvaQHkaD8n2G4zwd643bFOPeOVOE4KjgCjicAjANdFKCuvmsC6iCN0kshKC37xPmKiL+e/FSY3//n/sGrkimJx6XfmlRTLau6SSlKuHe8WeUq4jiJrGDpxIUlt7f1bghArXuzk9z9dArJpfkz/IG0dnhYlOpWTebtVRryKP8loukVEMHYOlxEAUw+R
 2G3pL13s
 XQAG3/SG3rB8wBCndvSiKHZyOvyovqjnTgmAQY9rVe+7Vsrf1uzff/4XLqyjcy1qqtFZ2ovwCIpMhHLH1tDS5IVEbpDEf8HDVjCoCHn83jK6kX9yuxDE7EAFN1L9lQ6tCAjkglCbUIoeKc8RFldOQnhzFEQVILtNf3eJf8Fi2rn5LlbCIgNvlHH81Jjam1oLJcTKDy5YgX08B8wrwO7Y4kCYI6rgV3XAskqmLk71jscPjTzzdD1TOhsyIVYmm+wQLYNBJf5lz7Vc1moAmr0mCnTIX5lmaILeBze8nrC+gVIP2+PD6qAIx0JEsSEodmJC83Y7JLFI8L1UPj7BKnDqinilojipjd8FPQ+ykWkfRVsF2fhSs/aMVlElqyQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 11, 2024, at 4:10 PM, Pasha Tatashin wrote:
> On Mon, Mar 11, 2024 at 6:17=E2=80=AFPM Andy Lutomirski <luto@kernel.o=
rg> wrote:
>>
>>
>>
>> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
>> > Add dynamic_stack_fault() calls to the kernel faults, and also decl=
are
>> > HAVE_ARCH_DYNAMIC_STACK =3D y, so that dynamic kernel stacks can be
>> > enabled on x86 architecture.
>> >
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > ---
>> >  arch/x86/Kconfig        | 1 +
>> >  arch/x86/kernel/traps.c | 3 +++
>> >  arch/x86/mm/fault.c     | 3 +++
>> >  3 files changed, 7 insertions(+)
>> >
>> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> > index 5edec175b9bf..9bb0da3110fa 100644
>> > --- a/arch/x86/Kconfig
>> > +++ b/arch/x86/Kconfig
>> > @@ -197,6 +197,7 @@ config X86
>> >       select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAUL=
TFD
>> >       select HAVE_ARCH_USERFAULTFD_MINOR      if X86_64 && USERFAUL=
TFD
>> >       select HAVE_ARCH_VMAP_STACK             if X86_64
>> > +     select HAVE_ARCH_DYNAMIC_STACK          if X86_64
>> >       select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>> >       select HAVE_ARCH_WITHIN_STACK_FRAMES
>> >       select HAVE_ASM_MODVERSIONS
>> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> > index c3b2f863acf0..cc05401e729f 100644
>> > --- a/arch/x86/kernel/traps.c
>> > +++ b/arch/x86/kernel/traps.c
>> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
>> >       }
>> >  #endif
>> >
>> > +     if (dynamic_stack_fault(current, address))
>> > +             return;
>> > +
>>
>> Sorry, but no, you can't necessarily do this.  I say this as the pers=
on who write this code, and I justified my code on the basis that we are=
 not recovering -- we're jumping out to a different context, and we won'=
t crash if the origin context for the fault is corrupt.  The SDM is real=
ly quite unambiguous about it: we're in an "abort" context, and returnin=
g is not allowed.  And I this may well be is the real deal -- the microc=
ode does not promise to have the return frame and the actual faulting co=
ntext matched up here, and there's is no architectural guarantee that re=
turning will do the right thing.
>>
>> Now we do have some history of getting a special exception, e.g. for =
espfix64.  But espfix64 is a very special case, and the situation you're=
 looking at is very general.  So unless Intel and AMD are both wiling to=
 publicly document that it's okay to handle stack overflow, where any in=
struction in the ISA may have caused the overflow, like this, then we're=
 not going to do it.
>
> Hi Andy,
>
> Thank you for the insightful feedback.
>
> I'm somewhat confused about why we end up in exc_double_fault() in the
> first place. My initial assumption was that dynamic_stack_fault()
> would only be needed within do_kern_addr_fault(). However, while
> testing in QEMU, I found that when using memset() on a stack variable,
> code like this:
>
> rep stos %rax,%es:(%rdi)
>
> causes a double fault instead of a regular fault. I added it to
> exc_double_fault() as a result, but I'm curious if you have any
> insights into why this behavior occurs.
>

Imagine you're a CPU running kernel code, on a fairly traditional archit=
ecture like x86.  The code tries to access some swapped out user memory.=
  You say "sorry, that memory is not present" and generate a page fault.=
  You save the current state *to the stack* and chance the program count=
er to point to the page fault handler.  The page fault handler does its =
thing, then pops the old state off the stack and resumes the faulting co=
de.

A few microseconds later, the kernel fills up its stack and then does:

PUSH something

but that would write to a not-present stack page, because you already fi=
lled the stack.  Okay, a page fault -- no big deal, we know how to handl=
e that.  So you push the current state to the stack.  On wait, you *can'=
t* push the current state to the stack, because that would involve writi=
ng to an unmapped page of memory.

So you trigger a double-fault.  You push some state to the double-fault =
handler's special emergency stack.  But wait, *what* state do you push? =
 Is it the state that did the "PUSH something" and overflowed the stack?=
  Or is some virtual state that's a mixture of that and the failed page =
fault handler?  What if the stack wasn't quite full and you actually suc=
ceeded in pushing the old stack pointer but not the old program counter?=
  What saved state goes where?

This is a complicated mess, so the people who designed all this said 'he=
y, wait a minute, let's not call double faults a "fault" -- let's call t=
hem an "abort"' so we can stop confusing ourselves and ship CPUs to cust=
omers.  And "abort" means "the saved state is not well defined -- don't =
rely on it having any particular meaning".

So, until a few years ago, we would just print something like "PANIC: do=
uble fault" and kill the whole system.  A few years ago, I decided this =
was lame, and I wanted to have stack guard pages, so i added real fancy =
new logic: instead, we do our best to display the old state, but it's a =
guess and all we're doing with it is printk -- if it's wrong, it's annoy=
ing, but that's all.  And then we kill the running thread -- instead of =
trying to return (and violating our sacred contract with the x86 archite=
cture), we *reset* the current crashing thread's state to a known-good s=
tate.  Then we return to *that* state.  Now we're off the emergency stac=
k and we're running something resembling normal kernel code, but we can'=
t return, as there is nowhere to return to.  But that's fine -- instead =
we kill the current thread, kind of like _exit().  That never returns, s=
o it's okay that we can't return.

But your patch adds a return statement to this whole mess, which will re=
turn to the moderately-likely-to-be-corrupt state that caused a double f=
ault inside the microcode for the page fault path, and you have stepped =
outside the well-defined path in the x86 architecture, and you've trigge=
red something akin to Undefined Behavior.  The CPU won't catch fire, but=
 it reserves the right to execute from an incorrect RSP and/or RIP, to b=
e in the middle of an instruction, etc.

(For that matter, what if there was exactly enough room to enter the pag=
e fault handler, but the very first instruction of the page fault handle=
r overflowed the stack?  Then you allocate more memory, get lucky and su=
ccessfully resume the page fault handler, and then promptly OOPS because=
 you run the page fault handler and it thinks you got a kernel page faul=
t?  My OOPS code handles that, but, again, it's not trying to recover.)

>> There are some other options: you could pre-map
>
> Pre-mapping would be expensive. It would mean pre-mapping the dynamic
> pages for every scheduled thread, and we'd still need to check the
> access bit every time a thread leaves the CPU.

That's a write to four consecutive words in memory, with no locking requ=
ired.

> Dynamic thread faults
> should be considered rare events and thus shouldn't significantly
> affect the performance of normal context switch operations. With 8K
> stacks, we might encounter only 0.00001% of stacks requiring an extra
> page, and even fewer needing 16K.

Well yes, but if you crash 0.0001% of the time due to the microcode not =
liking you, you lose. :)

>
>> Also, I think the whole memory allocation concept in this whole serie=
s is a bit odd.  Fundamentally, we *can't* block on these stack faults -=
- we may be in a context where blocking will deadlock.  We may be in the=
 page allocator.  Panicing due to kernel stack allocation  would be very=
 unpleasant.
>
> We never block during handling stack faults. There's a per-CPU page
> pool, guaranteeing availability for the faulting thread. The thread
> simply takes pages from this per-CPU data structure and refills the
> pool when leaving the CPU. The faulting routine is efficient,
> requiring a fixed number of loads without any locks, stalling, or even
> cmpxchg operations.

You can't block when scheduling, either.  What if you can't refill the p=
ool?