From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C6C43382CB for ; Wed, 15 Apr 2026 12:11:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776255098; cv=none; b=meIPNFjxWsZB8u3wJxZRcDeacSuSgHVFyNaGGwwBoyfbLMTOfccfY1uF7kUNr18/KDvgJuft8azrVPeGYfl+kAB807eWL1De7a1TOsumezK63jGwq0pTPVNB43b27BshEA1Rs4k1/NsgB3ig0ox2GkvYrsrI52vn2nQ+DmcTos0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776255098; c=relaxed/simple; bh=k03Tdk8RRR71VlIa0ih5umTwpkvGOoz5OFENuJkyvH0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=TRBHj0q4Ks+kFpr39RELRAj85qQEdhqoip3u9derkS9y6HI1ipNu1G9xND2jZ4AShfUWTCb9YAc+ZhRVrCvf0/DSXjKV6IDLLzIF2u5drT4F/x32MM4ExHKE4+cglJ0r6drW/JxZkrsJK9KmY5TALM196DpXPYdbHgb2WPEw7IQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=teikMa0k; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="teikMa0k" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C010EC19424; Wed, 15 Apr 2026 12:11:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776255098; bh=k03Tdk8RRR71VlIa0ih5umTwpkvGOoz5OFENuJkyvH0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=teikMa0kuzlj73sAfHfIILLEXMER7mx0FWUhmvgWbuIuLjhFsK9Cblo0pBsVlTcng Ymy88Pz9jFYPqhTNEn0FrnOratySSB80K0X0oZEF289p4/9SRolp4Vc9bs4Q71p2oW 4ZuszEJtZoSe7xoHBlqFwk8dCYtuBN0ElyuaPQWbREOLMb+yapnXjBHipnTKavAH// wrxgmrbuqikVb5EYgBmsCl9X6WY9pwr4tQSZq7esnFj/CHhDgKhp8wDoGRFMUbfzBf gbgQIVZRh9hw7omQhNn9X/KX1LDpg3Q4+WRc1g+ZcumpkM0iMNzR6ZqQxgLxiFwZFN VDGDSGtNFNG6Q== Date: Wed, 15 Apr 2026 14:11:35 +0200 From: Frederic Weisbecker To: Valentin Schneider Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, "Peter Zijlstra (Intel)" , Nicolas Saenz Julienne , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Arnaldo Carvalho de Melo , Josh Poimboeuf , Paolo Bonzini , Arnd Bergmann , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Sami Tolvanen , "David S. Miller" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Mel Gorman , Andrew Morton , Masahiro Yamada , Han Shen , Rik van Riel , Jann Horn , Dan Carpenter , Oleg Nesterov , Juri Lelli , Clark Williams , Tomas Glozar , Yair Podemsky , Marcelo Tosatti , Daniel Wagner , Petr Tesarik , Shrikanth Hegde Subject: Re: [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Message-ID: References: <20260324094801.3092968-1-vschneid@redhat.com> <20260324094801.3092968-10-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260324094801.3092968-10-vschneid@redhat.com> Le Tue, Mar 24, 2026 at 10:48:00AM +0100, Valentin Schneider a écrit : > text_poke_bp_batch() sends IPIs to all online CPUs to synchronize > them vs the newly patched instruction. CPUs that are executing in userspace > do not need this synchronization to happen immediately, and this is > actually harmful interference for NOHZ_FULL CPUs. > > As the synchronization IPIs are sent using a blocking call, returning from > text_poke_bp_batch() implies all CPUs will observe the patched > instruction(s), and this should be preserved even if the IPI is deferred. > In other words, to safely defer this synchronization, any kernel > instruction leading to the execution of the deferred instruction > sync must *not* be mutable (patchable) at runtime. > > This means we must pay attention to mutable instructions in the early entry > code: > - alternatives > - static keys > - static calls > - all sorts of probes (kprobes/ftrace/bpf/???) > > The early entry code is noinstr, which gets rid of the probes. > > Alternatives are safe, because it's boot-time patching (before SMP is > even brought up) which is before any IPI deferral can happen. > > This leaves us with static keys and static calls. Any static key used in > early entry code should be only forever-enabled at boot time, IOW > __ro_after_init (pretty much like alternatives). Exceptions to that will > now be caught by objtool. > > The deferred instruction sync is the CR3 RMW done as part of > kPTI when switching to the kernel page table: > > SDM vol2 chapter 4.3 - Move to/from control registers: > ``` > MOV CR* instructions, except for MOV CR8, are serializing instructions. > ``` > > Leverage the new kernel_cr3_loaded signal and the kPTI CR3 RMW to defer > sync_core() IPIs targeting NOHZ_FULL CPUs. > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Nicolas Saenz Julienne > Signed-off-by: Valentin Schneider > --- > arch/x86/include/asm/text-patching.h | 5 ++++ > arch/x86/kernel/alternative.c | 34 +++++++++++++++++++++++----- > arch/x86/kernel/kprobes/core.c | 4 ++-- > arch/x86/kernel/kprobes/opt.c | 4 ++-- > arch/x86/kernel/module.c | 2 +- > 5 files changed, 38 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h > index f2d142a0a862e..628e80f8318cd 100644 > --- a/arch/x86/include/asm/text-patching.h > +++ b/arch/x86/include/asm/text-patching.h > @@ -33,6 +33,11 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i > */ > extern void *text_poke(void *addr, const void *opcode, size_t len); > extern void smp_text_poke_sync_each_cpu(void); > +#ifdef CONFIG_TRACK_CR3 > +extern void smp_text_poke_sync_each_cpu_deferrable(void); > +#else > +#define smp_text_poke_sync_each_cpu_deferrable smp_text_poke_sync_each_cpu > +#endif > extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); > extern void *text_poke_copy(void *addr, const void *opcode, size_t len); > #define text_poke_copy text_poke_copy > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 28518371d8bf3..f3af77d7c533c 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -13,6 +14,7 @@ > #include > #include > #include > +#include > > int __read_mostly alternatives_patched; > > @@ -2706,11 +2708,29 @@ static void do_sync_core(void *info) > sync_core(); > } > > +static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func) > +{ > + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); > +} > + > void smp_text_poke_sync_each_cpu(void) > { > - on_each_cpu(do_sync_core, NULL, 1); > + __smp_text_poke_sync_each_cpu(NULL); > +} > + > +#ifdef CONFIG_TRACK_CR3 > +static bool do_sync_core_defer_cond(int cpu, void *info) > +{ > + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || > + per_cpu(kernel_cr3_loaded, cpu); || should be && ? Also I would again expect full ordering here with an smp_mb() before the check. So that: CPU 0 CPU 1 ----- ----- //enter_kernel //do_sync_core_defer_cond kernel_cr3_loaded = 1 WRITE page table smp_mb() smp_mb() WRITE cr3 READ kernel_cr3_loaded But I'm not sure if that ordering is enough to imply that if CPU 1 observes kernel_cr3_loaded == 0, then subsequent CPU 0 entering the kernel is guaranteed to flush the TLB with the latest page table write. Thoughts? Thanks. -- Frederic Weisbecker SUSE Labs