From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48782C5475B for ; Mon, 11 Mar 2024 23:34:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D9B3A6B015C; Mon, 11 Mar 2024 19:34:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D4ADA6B015D; Mon, 11 Mar 2024 19:34:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B77428D0008; Mon, 11 Mar 2024 19:34:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A38996B015C for ; Mon, 11 Mar 2024 19:34:45 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 56064160451 for ; Mon, 11 Mar 2024 23:34:45 +0000 (UTC) X-FDA: 81886365330.28.95C80A5 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf25.hostedemail.com (Postfix) with ESMTP id 9073DA0007 for ; Mon, 11 Mar 2024 23:34:42 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=D2ebXaYQ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf25.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710200083; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Ge18jz+Rgxa+6R5HyEUZbf9o8gifJJTv+63zrXqJ5o=; b=P3ZGF3dFciLpA3X1CuxYSgo60TZ8lNdFU2N8Sh2thrvIogDnvaf1SzmPFuL6b+MGlrFquk OcUdxUhlp8ruGq5jD8mGMtk0Wqgw8az2WyncSuAgEGiJaQlBcNy2BC/Ooze1t8yMa4B3Zd gAquPrno74gVn7sQ9hmA6tbQd0fNa2M= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=D2ebXaYQ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf25.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710200083; a=rsa-sha256; cv=none; b=ECpsd2SN11lWwMlVpNQ/cSlSNk+WWmlDJp7L184V19dalzvalaBKmooLfFttoeezSBWra8 wnz7bibLkBWlKJ6p0cr0COjpVffr4ueTGQZu9lvwv6O4ujpwLcNUPfMajdTzDUaoTSbnJh VPWqSA0jPPBayBhGsQts1QNBYiLzuBk= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id C7795CE13BD; Mon, 11 Mar 2024 23:34:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C1A5DC433C7; Mon, 11 Mar 2024 23:34:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1710200077; bh=ej4S/4faguy0dKqkF25qAMIA8oIYdtEceu7HJBeeSnE=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=D2ebXaYQbCKep5DIQpI35VTCXzJAM4C+4wogbK8JFgBNaLTx8SQDGQv8TEBNz6yii mQ2z4Ax2dmoV5RYJtxIznCaBOUSZdX665U7olEmKFkMqsBh98emMt6c9jZnzwQcCrB +ApX6+r9zL28qOMPs8gQYfpJMJkGk+hEUSmr0KXv5LBq/DDtBJrjb36X11N4hPPn5G S7fj9sze7579mzNk2FM/1s/uIcvuUnh1Dusgp/H4jtGbNaPIbtEUKVjk53Q1KrLyED UOjfi84KgZM4QGbw+wnSaPeWKuS27hz+z1kaA4blY85a2SGRka7ye2irVxSLGZ7MxW Sqi5ECazOKzxA== Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailfauth.nyi.internal (Postfix) with ESMTP id 9E34E1200043; Mon, 11 Mar 2024 19:34:34 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute3.internal (MEProxy); Mon, 11 Mar 2024 19:34:34 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrjedvgddufecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefofgggkfgjfhffhffvvefutgfgsehtqhertderreejnecuhfhrohhmpedftehn ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg ftrfgrthhtvghrnhepudevffdvgedvfefhgeejjeelgfdtffeukedugfekuddvtedvudei leeugfejgefgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh igrdhluhhtohdruhhs X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 0777331A0065; Mon, 11 Mar 2024 19:34:33 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.11.0-alpha0-251-g8332da0bf6-fm-20240305.001-g8332da0b MIME-Version: 1.0 Message-Id: <1ac305b1-d28f-44f6-88e5-c85d9062f9e8@app.fastmail.com> In-Reply-To: References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <20240311164638.2015063-12-pasha.tatashin@soleen.com> <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com> Date: Mon, 11 Mar 2024 16:34:11 -0700 From: "Andy Lutomirski" To: "Pasha Tatashin" Cc: "Linux Kernel Mailing List" , linux-mm@kvack.org, "Andrew Morton" , "the arch/x86 maintainers" , "Borislav Petkov" , "Christian Brauner" , bristot@redhat.com, "Ben Segall" , "Dave Hansen" , dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, "hch@infradead.org" , "H. Peter Anvin" , "Jacob Pan" , "Jason Gunthorpe" , jpoimboe@kernel.org, "Joerg Roedel" , juri.lelli@redhat.com, "Kent Overstreet" , kinseyho@google.com, "Kirill A. Shutemov" , lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, "Ingo Molnar" , mjguzik@gmail.com, "Michael S. Tsirkin" , "Nicholas Piggin" , "Peter Zijlstra (Intel)" , "Petr Mladek" , "Rick P Edgecombe" , "Steven Rostedt" , "Suren Baghdasaryan" , "Thomas Gleixner" , "Uladzislau Rezki" , vincent.guittot@linaro.org, vschneid@redhat.com Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: rmx54opipybwhrqruyqzdw8qp9i9ein5 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 9073DA0007 X-HE-Tag: 1710200082-133499 X-HE-Meta: U2FsdGVkX190PQOclK1hTxNgFy3ctx2WCBGboG14y4yMFLujc8tGVC/yvp4+FFZZa9tflVW2MhnLlCvjSdWYf8ZNnDDJrDghRDRhMzZ4Jx0U50zmoP9MrteAZz1q4/IqPvuaD9sJnTMFd/annl36Q4uyQr68pRxqJQrUwThMdPKS/5GkSHUvPWB9GcGqv4TxEVPlzxrtig/eoUwGoetC5iUypNDU7nC1qrYTjKAdXrEkTv+/te2oprkaxWUqEFMK09JzZNpgA6k/YDX0mtNY7rFzbbOBRBoXvq72gUsbatCw26AVQgXkVxtCxcBqZLZHzwqeErQzThl84cAAMSexoOh6B8rAGFaUvBgccJflL1VfYe+Bs7rFWPCQT1iCMRWgc6P4Rkn1BbzDj+dOg20F7y2/zQLPnlARzfNFkjpqf+8cdr0hEEZHUNH81aMonXfK/tjLDvJ4wNqXn79ZWOeKbRqs9ZRKTD/fBzP4sFzc2EwU++cQx4x6jE21Cp08A49H7mFezqsNhyuPqidp313P/vHKEtUYJ9UFWSRuRhHV+gmghFeXbITtlxOMK5DkLljUY+xh0PpNsIQO4wkB7qzpo9knKB1qlaS+zUOYDHDk6XITu+w9gGkZYwk7JI3r0b/LaWFykFxDd2Rl7labogL0GsQWKk/jAnRvSLGde0s8qek+cT2C9qDEbgap6uCDh1K9MoRX7f4KmUdaOlkOqLmPwu0utpyamETsAft4KGOGS0cxX6dCZUKEXT0kgv3pxCA77ns9mvvL0bxwuqngGlKBxzHKDgvaQHkaD8n2G4zwd643bFOPeOVOE4KjgCjicAjANdFKCuvmsC6iCN0kshKC37xPmKiL+e/FSY3//n/sGrkimJx6XfmlRTLau6SSlKuHe8WeUq4jiJrGDpxIUlt7f1bghArXuzk9z9dArJpfkz/IG0dnhYlOpWTebtVRryKP8loukVEMHYOlxEAUw+R 2G3pL13s XQAG3/SG3rB8wBCndvSiKHZyOvyovqjnTgmAQY9rVe+7Vsrf1uzff/4XLqyjcy1qqtFZ2ovwCIpMhHLH1tDS5IVEbpDEf8HDVjCoCHn83jK6kX9yuxDE7EAFN1L9lQ6tCAjkglCbUIoeKc8RFldOQnhzFEQVILtNf3eJf8Fi2rn5LlbCIgNvlHH81Jjam1oLJcTKDy5YgX08B8wrwO7Y4kCYI6rgV3XAskqmLk71jscPjTzzdD1TOhsyIVYmm+wQLYNBJf5lz7Vc1moAmr0mCnTIX5lmaILeBze8nrC+gVIP2+PD6qAIx0JEsSEodmJC83Y7JLFI8L1UPj7BKnDqinilojipjd8FPQ+ykWkfRVsF2fhSs/aMVlElqyQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024, at 4:10 PM, Pasha Tatashin wrote: > On Mon, Mar 11, 2024 at 6:17=E2=80=AFPM Andy Lutomirski wrote: >> >> >> >> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: >> > Add dynamic_stack_fault() calls to the kernel faults, and also decl= are >> > HAVE_ARCH_DYNAMIC_STACK =3D y, so that dynamic kernel stacks can be >> > enabled on x86 architecture. >> > >> > Signed-off-by: Pasha Tatashin >> > --- >> > arch/x86/Kconfig | 1 + >> > arch/x86/kernel/traps.c | 3 +++ >> > arch/x86/mm/fault.c | 3 +++ >> > 3 files changed, 7 insertions(+) >> > >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> > index 5edec175b9bf..9bb0da3110fa 100644 >> > --- a/arch/x86/Kconfig >> > +++ b/arch/x86/Kconfig >> > @@ -197,6 +197,7 @@ config X86 >> > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAUL= TFD >> > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAUL= TFD >> > select HAVE_ARCH_VMAP_STACK if X86_64 >> > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 >> > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET >> > select HAVE_ARCH_WITHIN_STACK_FRAMES >> > select HAVE_ASM_MODVERSIONS >> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c >> > index c3b2f863acf0..cc05401e729f 100644 >> > --- a/arch/x86/kernel/traps.c >> > +++ b/arch/x86/kernel/traps.c >> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) >> > } >> > #endif >> > >> > + if (dynamic_stack_fault(current, address)) >> > + return; >> > + >> >> Sorry, but no, you can't necessarily do this. I say this as the pers= on who write this code, and I justified my code on the basis that we are= not recovering -- we're jumping out to a different context, and we won'= t crash if the origin context for the fault is corrupt. The SDM is real= ly quite unambiguous about it: we're in an "abort" context, and returnin= g is not allowed. And I this may well be is the real deal -- the microc= ode does not promise to have the return frame and the actual faulting co= ntext matched up here, and there's is no architectural guarantee that re= turning will do the right thing. >> >> Now we do have some history of getting a special exception, e.g. for = espfix64. But espfix64 is a very special case, and the situation you're= looking at is very general. So unless Intel and AMD are both wiling to= publicly document that it's okay to handle stack overflow, where any in= struction in the ISA may have caused the overflow, like this, then we're= not going to do it. > > Hi Andy, > > Thank you for the insightful feedback. > > I'm somewhat confused about why we end up in exc_double_fault() in the > first place. My initial assumption was that dynamic_stack_fault() > would only be needed within do_kern_addr_fault(). However, while > testing in QEMU, I found that when using memset() on a stack variable, > code like this: > > rep stos %rax,%es:(%rdi) > > causes a double fault instead of a regular fault. I added it to > exc_double_fault() as a result, but I'm curious if you have any > insights into why this behavior occurs. > Imagine you're a CPU running kernel code, on a fairly traditional archit= ecture like x86. The code tries to access some swapped out user memory.= You say "sorry, that memory is not present" and generate a page fault.= You save the current state *to the stack* and chance the program count= er to point to the page fault handler. The page fault handler does its = thing, then pops the old state off the stack and resumes the faulting co= de. A few microseconds later, the kernel fills up its stack and then does: PUSH something but that would write to a not-present stack page, because you already fi= lled the stack. Okay, a page fault -- no big deal, we know how to handl= e that. So you push the current state to the stack. On wait, you *can'= t* push the current state to the stack, because that would involve writi= ng to an unmapped page of memory. So you trigger a double-fault. You push some state to the double-fault = handler's special emergency stack. But wait, *what* state do you push? = Is it the state that did the "PUSH something" and overflowed the stack?= Or is some virtual state that's a mixture of that and the failed page = fault handler? What if the stack wasn't quite full and you actually suc= ceeded in pushing the old stack pointer but not the old program counter?= What saved state goes where? This is a complicated mess, so the people who designed all this said 'he= y, wait a minute, let's not call double faults a "fault" -- let's call t= hem an "abort"' so we can stop confusing ourselves and ship CPUs to cust= omers. And "abort" means "the saved state is not well defined -- don't = rely on it having any particular meaning". So, until a few years ago, we would just print something like "PANIC: do= uble fault" and kill the whole system. A few years ago, I decided this = was lame, and I wanted to have stack guard pages, so i added real fancy = new logic: instead, we do our best to display the old state, but it's a = guess and all we're doing with it is printk -- if it's wrong, it's annoy= ing, but that's all. And then we kill the running thread -- instead of = trying to return (and violating our sacred contract with the x86 archite= cture), we *reset* the current crashing thread's state to a known-good s= tate. Then we return to *that* state. Now we're off the emergency stac= k and we're running something resembling normal kernel code, but we can'= t return, as there is nowhere to return to. But that's fine -- instead = we kill the current thread, kind of like _exit(). That never returns, s= o it's okay that we can't return. But your patch adds a return statement to this whole mess, which will re= turn to the moderately-likely-to-be-corrupt state that caused a double f= ault inside the microcode for the page fault path, and you have stepped = outside the well-defined path in the x86 architecture, and you've trigge= red something akin to Undefined Behavior. The CPU won't catch fire, but= it reserves the right to execute from an incorrect RSP and/or RIP, to b= e in the middle of an instruction, etc. (For that matter, what if there was exactly enough room to enter the pag= e fault handler, but the very first instruction of the page fault handle= r overflowed the stack? Then you allocate more memory, get lucky and su= ccessfully resume the page fault handler, and then promptly OOPS because= you run the page fault handler and it thinks you got a kernel page faul= t? My OOPS code handles that, but, again, it's not trying to recover.) >> There are some other options: you could pre-map > > Pre-mapping would be expensive. It would mean pre-mapping the dynamic > pages for every scheduled thread, and we'd still need to check the > access bit every time a thread leaves the CPU. That's a write to four consecutive words in memory, with no locking requ= ired. > Dynamic thread faults > should be considered rare events and thus shouldn't significantly > affect the performance of normal context switch operations. With 8K > stacks, we might encounter only 0.00001% of stacks requiring an extra > page, and even fewer needing 16K. Well yes, but if you crash 0.0001% of the time due to the microcode not = liking you, you lose. :) > >> Also, I think the whole memory allocation concept in this whole serie= s is a bit odd. Fundamentally, we *can't* block on these stack faults -= - we may be in a context where blocking will deadlock. We may be in the= page allocator. Panicing due to kernel stack allocation would be very= unpleasant. > > We never block during handling stack faults. There's a per-CPU page > pool, guaranteeing availability for the faulting thread. The thread > simply takes pages from this per-CPU data structure and refills the > pool when leaving the CPU. The faulting routine is efficient, > requiring a fixed number of loads without any locks, stalling, or even > cmpxchg operations. You can't block when scheduling, either. What if you can't refill the p= ool?