From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 423BFC5475B for ; Mon, 11 Mar 2024 22:18:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C9CE86B00F5; Mon, 11 Mar 2024 18:18:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C4D1E6B00F6; Mon, 11 Mar 2024 18:18:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AEDE26B00F7; Mon, 11 Mar 2024 18:18:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9FC256B00F5 for ; Mon, 11 Mar 2024 18:18:05 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 465CFA0F59 for ; Mon, 11 Mar 2024 22:18:05 +0000 (UTC) X-FDA: 81886172130.15.884A5AC Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf20.hostedemail.com (Postfix) with ESMTP id 86B861C000C for ; Mon, 11 Mar 2024 22:18:02 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=rwEAyRcR; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710195483; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DwwSiasffcXSHs+m4EMVUbepsW044z7Gs4MqQzsXcKA=; b=PBioZ7N0CM7PQ+ghgy9NqlxntljxZGhhk0hjH3TkmXpIjfNAbfvxvD7olhNzoLZP7o048b 08GvmIjLB6026GNpvCVtBA17VeC+PZ+XLmKuyypv/kyY8DOgcQs1uyXCJ7qvzARJLL2l43 04OILKuvUBqkVWTtpF8TGpd55dTLRJ0= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=rwEAyRcR; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of luto@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710195483; a=rsa-sha256; cv=none; b=N1MDG2aUt2TGAdpoz4i7G9+8LHIA3CY3Yrnl7lRxNvq8j8sZb6k4FKoWiTBe2emEbnicdj P098gkGestg5vQ0P6UXM1vNcX/BNQcsC+CroDcW074OFAD2mFtzKQIsy+Yja2IzaQeJ83R KHRSnCKZtYMP+dvyqqDepSc021RBvR0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 5A31DCE1320; Mon, 11 Mar 2024 22:17:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 48675C43390; Mon, 11 Mar 2024 22:17:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1710195477; bh=95Rl3G56TtuRLu2KqQRFkwTWeE+G21kYpTqUOyjv934=; h=In-Reply-To:References:Date:From:To:Subject:From; b=rwEAyRcR099GDXQJj6g1xLzxxxHxBG7IDgx6JAzIhQHMbPr/CwVJCBOrXHNJEKxp/ VNLmSsDC3Wdrp+TBmdNSl/HzEFjxosIMnIr1DEd5nE/PSWpSkTaa0gXIg62wQSrYqV Tz/2Oijf7baCG7UAPrPzcB+/xXD3ob9BdoPhIY0FxSfIvk4kcYY8dpZOtYOBpVzOcI gcokVb8znsJdrrZKYiBqzCXIQvda95ca6yTBBRoDpLZHMbvOckSYOtUAtmFEeODuhT HKcqpqDUfpig60EZlDcgE1j52i+WpfDrhgj8Z0E4J1xNy7O5cSO5WYXMzjmtJaR9oK HrCd1E6+p5s3A== Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailfauth.nyi.internal (Postfix) with ESMTP id 1CC811200043; Mon, 11 Mar 2024 18:17:55 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute3.internal (MEProxy); Mon, 11 Mar 2024 18:17:55 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrjedugdduheelucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepofgfggfkjghffffhvffutgfgsehtqhertderredtnecuhfhrohhmpedftehn ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg ftrfgrthhtvghrnhepjeeggfeiudeuvdffteduvdfghfffjeelgfffhfeggeeikeeujeeh ledvtedtvdevnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh igrdhluhhtohdruhhs X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 27DB131A0065; Mon, 11 Mar 2024 18:17:53 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.11.0-alpha0-251-g8332da0bf6-fm-20240305.001-g8332da0b MIME-Version: 1.0 Message-Id: <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com> In-Reply-To: <20240311164638.2015063-12-pasha.tatashin@soleen.com> References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <20240311164638.2015063-12-pasha.tatashin@soleen.com> Date: Mon, 11 Mar 2024 15:17:28 -0700 From: "Andy Lutomirski" To: "Pasha Tatashin" , "Linux Kernel Mailing List" , linux-mm@kvack.org, "Andrew Morton" , "the arch/x86 maintainers" , "Borislav Petkov" , "Christian Brauner" , bristot@redhat.com, "Ben Segall" , "Dave Hansen" , dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, "hch@infradead.org" , "H. Peter Anvin" , "Jacob Pan" , "Jason Gunthorpe" , jpoimboe@kernel.org, "Joerg Roedel" , juri.lelli@redhat.com, "Kent Overstreet" , kinseyho@google.com, "Kirill A. Shutemov" , lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, "Ingo Molnar" , mjguzik@gmail.com, "Michael S. Tsirkin" , "Nicholas Piggin" , "Peter Zijlstra (Intel)" , "Petr Mladek" , "Rick P Edgecombe" , "Steven Rostedt" , "Suren Baghdasaryan" , "Thomas Gleixner" , "Uladzislau Rezki" , vincent.guittot@linaro.org, vschneid@redhat.com Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks Content-Type: text/plain Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 86B861C000C X-Stat-Signature: wjjjne3oybbbxpj4bnd3nw6bnipicfn9 X-Rspam-User: X-HE-Tag: 1710195482-629774 X-HE-Meta: U2FsdGVkX1+pX2uvsbBu/c2XL9YXmBD1k2XArhr8c5KqjS3OEe8dK499xEcJNlvLbOv22BBJKiavDIpY0zmZq+KGaVl+FFo3v6Uq2rYDRnBAzoR/QrbRQMFLdh8a6yc3L+/A+8UulfVRc0MsaPbDS/Cx0qHEaYHq/vbYLz1p80NkhtM/fe9SASdLcKtl7EVDQXbT5NldY+Lbj49/9MAZ1K934MG7hETd7AbJ/V+QVsVI6/Umi3j8R572mcLphhH/74X1uZSJ6fivAcMt9s49X8EjLi1dryBzLN+SmZ212WZdzhs66RMtmAKJiy8eIwFdWVvsptXZvuwRxJSMe5V7i8g0NiLr7JdtM3JV1L3eBxg+8BJjFyFENAeZy6u1iLUjEpsqWgbGQPA8m+vxofl1XdaaBTXxgBGmZufJ0gLrhRy4sjJqaDzYBnTegEarPlyZMBQwOTE8YsGOPYRljYR1dwCUO/aM1CjRnVqk6XmD+FM8Zg3LOn3c6tslg1/FWAqhvBHpP7CEVaTw4wYsad+nVLwhuc+537r2AaEmf6NB+HDQsQ/qRaQ39nk5wz1RLw7qoYgNUekjg38ebT8dsPFAr5ksmQu9xCovSJqOWbxqp3FlCHKj4HzOG2lu75by3/GJMIles7JNrxKp7p2ltYWKiYtpJLsBfe0sBu97GZ7hnIfbUePDRDx2rqflnjRcsGv1MiZ6GTWnDFETh7wLQqkODmlvtQCax3AvMBfJp2IH6Hy7zR2WvBdQg+tecZNX6SsFHPFeOiXVzrI7c0Iph2sES0o43djz/lVbDvDb5wnm7S9nFcB8Y6L0ARfnzVIGbMDrfUzNf1EoRVR4XSoOB8fQSFPcw0zQusrVhU3Uz+Jl6HpOk1hfw5zEe1Z2iRDkrMrmzGAShvBzaBYQLNT7sOSglSKdIvV4h3enT8Ndvj5yOUsvEjdxZ+iqAiwsW5vcAm5K0KwfYlHXmqJtZ8YSTCh 3uA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: > Add dynamic_stack_fault() calls to the kernel faults, and also declare > HAVE_ARCH_DYNAMIC_STACK =3D y, so that dynamic kernel stacks can be > enabled on x86 architecture. > > Signed-off-by: Pasha Tatashin > --- > arch/x86/Kconfig | 1 + > arch/x86/kernel/traps.c | 3 +++ > arch/x86/mm/fault.c | 3 +++ > 3 files changed, 7 insertions(+) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 5edec175b9bf..9bb0da3110fa 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -197,6 +197,7 @@ config X86 > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD > select HAVE_ARCH_VMAP_STACK if X86_64 > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET > select HAVE_ARCH_WITHIN_STACK_FRAMES > select HAVE_ASM_MODVERSIONS > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index c3b2f863acf0..cc05401e729f 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > } > #endif >=20 > + if (dynamic_stack_fault(current, address)) > + return; > + Sorry, but no, you can't necessarily do this. I say this as the person = who write this code, and I justified my code on the basis that we are no= t recovering -- we're jumping out to a different context, and we won't c= rash if the origin context for the fault is corrupt. The SDM is really = quite unambiguous about it: we're in an "abort" context, and returning i= s not allowed. And I this may well be is the real deal -- the microcode= does not promise to have the return frame and the actual faulting conte= xt matched up here, and there's is no architectural guarantee that retur= ning will do the right thing. Now we do have some history of getting a special exception, e.g. for esp= fix64. But espfix64 is a very special case, and the situation you're lo= oking at is very general. So unless Intel and AMD are both wiling to pu= blicly document that it's okay to handle stack overflow, where any instr= uction in the ISA may have caused the overflow, like this, then we're no= t going to do it. There are some other options: you could pre-map=20 Also, I think the whole memory allocation concept in this whole series i= s a bit odd. Fundamentally, we *can't* block on these stack faults -- w= e may be in a context where blocking will deadlock. We may be in the pa= ge allocator. Panicing due to kernel stack allocation would be very unp= leasant. But perhaps we could have a rule that a task can only be sched= uled in if there is sufficient memory available for its stack. And perh= aps we could avoid every page-faulting by filling in the PTEs for the po= tential stack pages but leaving them un-accessed. I *think* that all x8= 6 implementations won't fill the TLB for a non-accessed page without als= o setting the accessed bit, so the performance hit of filling the PTEs, = running the task, and then doing the appropriate synchronization to clea= r the PTEs and read the accessed bit on schedule-out to release the page= s may not be too bad. But you would need to do this cautiously in the s= cheduler, possibly in the *next* task but before the prev task is actual= ly released enough to be run on a different CPU. It's going to be messy= .