From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BACE5CF397E for ; Wed, 19 Nov 2025 17:32:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:Subject:References:In-Reply-To:Message-Id:Cc:To:From:Date: MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=XZQYt+JZTf9u0DSrzjR8u6+IZq2TyCBjcdNC43g8UR4=; b=CgKmWEzmMXkbtFjtEfCtf9nfus gJvxTtdl8Wg7qrUNcBdNHGMXeI/vvvAAo69OToJB71JWBIavF+1otcRVBgCbeQiBylNrrSsCQANBZ Vzij1EU9d6HMe1O0LhoUUMy5CWKMzfxBnI1CZGp3MIuKI1EeA1VGMTTNttk9rjFXtdKD/0QrnkO03 73qeV4FhTmMunbpQRXu1wQO4tzhNfgW+WVLtUKkc0ZDTN1Xog9DRGl/cMxB/nHcPPpsGcxPdSKH2j FDVObEvQRlICdajuUgymFuMkx+zvPBK5j/cSSMZF/RQBiwbt6mzWgnTHb5X9NFWBBrMyjywPq8GK0 svjc2s3g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1vLm2I-0000000481r-4437; Wed, 19 Nov 2025 17:32:07 +0000 Received: from sea.source.kernel.org ([2600:3c0a:e001:78e:0:1991:8:25]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1vLm2E-000000047vG-3PFG for linux-arm-kernel@lists.infradead.org; Wed, 19 Nov 2025 17:32:05 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id 2007743789; Wed, 19 Nov 2025 17:32:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 25202C4CEF5; Wed, 19 Nov 2025 17:32:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763573522; bh=/t2oTQtncuM/mvzgreWw/hJ7gSFNgj07SaGnfOfObKg=; h=Date:From:To:Cc:In-Reply-To:References:Subject:From; b=W3Cfn2oiDvKGRYk7XXu9FsNDHJc9fbUZ/qZ+tByWGnzj7O2u+nFpOup7P765xJoR7 fejiA1bvGY68LQY2bTW9qrmQtFR2JQ7Ua3voayJaJBfdEZG6V+7rn3DXerADQmfx8C ltL0RyFVWUnEKu5I1M47Z9VbiFmBTmZaAUjXKLZQQH6q0fn921XQ8/UAnvoPAoUd9R eFykJH5ntv+cTxiMbS2XplL+Z7yfWo/ko/FORBRQslCJIjTNYVC1t5QItXkYCUhdTf g/3VwppzhyevZgimVZXmws5EQDLPbbMTlKJ2RpOVpdo5QOpE7gyyiZaTLC9nN5U9R1 1b++N/eo2Q4dw== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 15FF1F4006F; Wed, 19 Nov 2025 12:31:59 -0500 (EST) Received: from phl-imap-02 ([10.202.2.81]) by phl-compute-01.internal (MEProxy); Wed, 19 Nov 2025 12:31:59 -0500 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvvdegkedtucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepofggfffhvfevkfgjfhfutgfgsehtqhertdertdejnecuhfhrohhmpedftehnugih ucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecuggftrf grthhtvghrnhephffgieeuueevvddvffehiedtteduveejtefhuedtteehfffgieehhfeg ffehvddvnecuffhomhgrihhnpehkvghrnhgvlhdrohhrghenucevlhhushhtvghrufhiii gvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnhguhidomhgvshhmthhprghuthhh phgvrhhsohhnrghlihhthidqudduiedukeehieefvddqvdeifeduieeitdekqdhluhhtoh eppehkvghrnhgvlhdrohhrgheslhhinhhugidrlhhuthhordhushdpnhgspghrtghpthht ohepgeekpdhmohguvgepshhmthhpohhuthdprhgtphhtthhopehjsggrrhhonhesrghkrg hmrghirdgtohhmpdhrtghpthhtohepsghpsegrlhhivghnkedruggvpdhrtghpthhtohep rghrnhgusegrrhhnuggsrdguvgdprhgtphhtthhopegurghvvghmsegurghvvghmlhhofh htrdhnvghtpdhrtghpthhtohepmhgrthhhihgvuhdruggvshhnohihvghrshesvghffhhi tghiohhsrdgtohhmpdhrtghpthhtohepsghoqhhunhdrfhgvnhhgsehgmhgrihhlrdgtoh hmpdhrtghpthhtohepuhhrvgiikhhisehgmhgrihhlrdgtohhmpdhrtghpthhtoheprhho shhtvgguthesghhoohgumhhishdrohhrghdprhgtphhtthhopehjrghnnhhhsehgohhogh hlvgdrtghomh X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id D1DC6700063; Wed, 19 Nov 2025 12:31:58 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: A2ZwPDH9FoLc Date: Wed, 19 Nov 2025 09:31:37 -0800 From: "Andy Lutomirski" To: "Valentin Schneider" , "Linux Kernel Mailing List" , linux-mm@kvack.org, rcu@vger.kernel.org, "the arch/x86 maintainers" , linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-riscv@lists.infradead.org, linux-arch@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: "Thomas Gleixner" , "Ingo Molnar" , "Borislav Petkov" , "Dave Hansen" , "H. Peter Anvin" , "Peter Zijlstra (Intel)" , "Arnaldo Carvalho de Melo" , "Josh Poimboeuf" , "Paolo Bonzini" , "Arnd Bergmann" , "Frederic Weisbecker" , "Paul E. McKenney" , "Jason Baron" , "Steven Rostedt" , "Ard Biesheuvel" , "Sami Tolvanen" , "David S. Miller" , "Neeraj Upadhyay" , "Joel Fernandes" , "Josh Triplett" , "Boqun Feng" , "Uladzislau Rezki" , "Mathieu Desnoyers" , "Mel Gorman" , "Andrew Morton" , "Masahiro Yamada" , "Han Shen" , "Rik van Riel" , "Jann Horn" , "Dan Carpenter" , "Oleg Nesterov" , "Juri Lelli" , "Clark Williams" , "Yair Podemsky" , "Marcelo Tosatti" , "Daniel Wagner" , "Petr Tesarik" , "Shrikanth Hegde" Message-Id: <91702ceb-afba-450e-819b-52d482d7bd11@app.fastmail.com> In-Reply-To: References: <20251114150133.1056710-1-vschneid@redhat.com> <20251114151428.1064524-9-vschneid@redhat.com> <65ae9404-5d7d-42a3-969e-7e2ceb56c433@app.fastmail.com> Subject: Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20251119_093202_889499_74EEB876 X-CRM114-Status: GOOD ( 32.60 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, Nov 19, 2025, at 7:44 AM, Valentin Schneider wrote: > On 19/11/25 06:31, Andy Lutomirski wrote: >> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote: >>> Deferring kernel range TLB flushes requires the guarantee that upon >>> entering the kernel, no stale entry may be accessed. The simplest wa= y to >>> provide such a guarantee is to issue an unconditional flush upon swi= tching >>> to the kernel CR3, as this is the pivoting point where such stale en= tries >>> may be accessed. >>> >> >> Doing this together with the PTI CR3 switch has no actual benefit: MO= V CR3 doesn=E2=80=99t flush global pages. And doing this in asm is prett= y gross. We don=E2=80=99t even get a free sync_core() out of it because= INVPCID is not documented as being serializing. >> >> Why can=E2=80=99t we do it in C? What=E2=80=99s the actual risk? In= order to trip over a stale TLB entry, we would need to deference a poin= ter to newly allocated kernel virtual memory that was not valid prior to= our entry into user mode. I can imagine BPF doing this, but plain noins= tr C in the entry path? Especially noinstr C *that has RCU disabled*? = We already can=E2=80=99t follow an RCU pointer, and ISTM the only style = of kernel code that might do this would use RCU to protect the pointer, = and we are already doomed if we follow an RCU pointer to any sort of mem= ory. >> > > So v4 and earlier had the TLB flush faff done in C in the context_trac= king entry > just like sync_core(). > > My biggest issue with it was that I couldn't figure out a way to instr= ument > memory accesses such that I would get an idea of where vmalloc'd acces= ses > happen - even with a hackish thing just to survey the landscape. So wh= ile I > agree with your reasoning wrt entry noinstr code, I don't have any way= to > prove it. > That's unlike the text_poke sync_core() deferral for which I have all = of > that nice objtool instrumentation. > > Dave also pointed out that the whole stale entry flush deferral is a r= isky > move, and that the sanest thing would be to execute the deferred flush= just > after switching to the kernel CR3. > > See the thread surrounding: > https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat= .com/ > > mainly Dave's reply and subthread: > https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@in= tel.com/ > >> We do need to watch out for NMI/MCE hitting before we flush. I read a decent fraction of that thread. Let's consider what we're worried about: 1. Architectural access to a kernel virtual address that has been unmapp= ed, in asm or early C. If it hasn't been remapped, then we oops anyway.= If it has, then that means we're accessing a pointer where either the = pointer has changed or the pointee has been remapped while we're in user= mode, and that's a very strange thing to do for anything that the asm p= oints to or that early C points to, unless RCU is involved. But RCU is = already disallowed in the entry paths that might be in extended quiescen= t states, so I think this is mostly a nonissue. 2. Non-speculative access via GDT access, etc. We can't control this at= all, but we're not avoid to move the GDT, IDT, LDT etc of a running tas= k while that task is in user mode. We do move the LDT, but that's quite= thoroughly synchronized via IPI. (Should probably be double checked. = I wrote that code, but that doesn't mean I remember it exactly.) 3. Speculative TLB fills. We can't control this at all. We have had ac= tual machine checks, on AMD IIRC, due to messing this up. This is why w= e can't defer a flush after freeing a page table. 4. Speculative or other nonarchitectural loads. One would hope that the= se are not dangerous. For example, an early version of TDX would machin= e check if we did a speculative load from TDX memory, but that was fixed= . I don't see why this would be materially different between actual use= rspace execution (without LASS, anyway), kernel asm, and kernel C. 5. Writes to page table dirty bits. I don't think we use these. In any case, the current implementation in your series is really, really= , utterly horrifically slow. It's probably fine for a task that genuine= ly sits in usermode forever, but I don't think it's likely to be somethi= ng that we'd be willing to enable for normal kernels and normal tasks. = And it would be really nice for the don't-interrupt-user-code still to m= ove toward being always available rather than further from it. I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn'= t need any of this. Some Intel CPUs support RAR and will eventually be = able to use RAR, possibly even for sync_core().