[PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2025-11-14 15:01 Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
                   ` (31 more replies)
  0 siblings, 32 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

The annoying one: TLB flush deferral
====================================

While leveraging the context_tracking subsystem works for deferring things like
kernel text synchronization, it falls apart when it comes to kernel range TLB
flushes. Consider the following execution flow:

  <userspace>
  
  !interrupt!

  SWITCH_TO_KERNEL_CR3        <--- vmalloc range becomes accessible

  idtentry_func_foo()
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush() <--- deferred flush would be done here


Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry.

Dave had suggested hacking up something within SWITCH_TO_KERNEL_CR3 itself,
which is what has been implemented in the new RFC patches.

How bad is it?
==============

Code
++++

I'm happy that the COALESCE_TLBI asm code fits in ~half a screen,
although it open-codes native_write_cr4() without the pinning logic.

I hate the kernel_cr3_loaded signal; it's a kludgy context_tracking.state
duplicate but I need *some* sort of signal to drive the TLB flush deferral and
the context_tracking.state one is set too late in kernel entry. I couldn't
find any fitting existing signals for this.

I'm also unhappy to introduce two different IPI deferral mechanisms. I tried
shoving the text_poke_sync() in KERNEL_SWITCH_CR3, but it got ugly(er) really
fast. 

Performance
+++++++++++

Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
housekeeping CPUs:

o Xeon E5-2699:   base avg 770ns,  patched avg 1340ns (74% increase)
o Xeon E7-8890:   base avg 1040ns, patched avg 1320ns (27% increase)
o Xeon Gold 6248: base avg 270ns,  patched avg 273ns  (.1% increase)

I don't get that last one, I did spend a ridiculous amount of time making sure
the flush was being executed, and AFAICT yes, it was. What I take out of this is
that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
CPUs), and that's something I want to hear thoughts on

Noise
+++++

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.17
o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
o About one interfering IPI just shy of every 2 minutes

v6.17 + patches
o Zilch!

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-21 use said annotations on relevant keys / calls.
o Patch 22 enforces proper usage of said annotations (courtesy of Josh).

o Patch 23 deals with detecting NOINSTR text in modules

o Patches 24-25 deal with kernel text sync IPIs

o Patch 26 adds ASM support for static keys

o Patches 27-31 deal with kernel range TLB flush IPIs

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v7

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

v6 -> v7
++++++++

o Rebased onto latest v6.18-rc5 (6fa9041b7177f)
o Collected Acks (Sean, Frederic)

o Fixed <asm/context_tracking_work.h> include (Shrikanth)
o Fixed ct_set_cpu_work() CT_RCU_WATCHING logic (Frederic)

o Wrote more verbose comments about NOINSTR static keys and calls (Petr)

o [NEW PATCH] Instrumented one more static key: cpu_bf_vm_clear
o [NEW PATCH] added ASM-accessible static key helpers to gate NO_HZ_FULL logic
  in early entry code (Frederic)

v5 -> v6
++++++++

o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming

o Added the TLB flush craziness

v4 -> v5
++++++++

o Rebased onto v6.15-rc3
o Collected Reviewed-by

o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)

o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)

o Ditched the vmap TLB flush deferral (for now)  
  

RFCv3 -> v4
+++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Valentin Schneider (28):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr
  x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in
    .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  module: Add MOD_NOINSTR_TEXT mem_type
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  x86/jump_label: Add ASM support for static_branch_likely()
  x86/mm: Make INVPCID type macros available to assembly
  x86/mm/pti: Introduce a kernel/user CR3 software signal
  x86/mm/pti: Implement a TLB flush immediately after a switch to kernel
    CR3
  x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under
    CONFIG_COALESCE_TLBI=y
  x86/entry: Add an option to coalesce TLB flushes

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |  18 +++
 arch/x86/entry/calling.h                      |  40 +++++++
 arch/x86/entry/syscall_64.c                   |   4 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  18 +++
 arch/x86/include/asm/invpcid.h                |  14 ++-
 arch/x86/include/asm/jump_label.h             |  33 +++++-
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   6 +
 arch/x86/kernel/alternative.c                 |  39 ++++++-
 arch/x86/kernel/asm-offsets.c                 |   1 +
 arch/x86/kernel/cpu/bugs.c                    |  14 ++-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/vmx/vmx_onhyperv.c               |   2 +-
 arch/x86/mm/tlb.c                             |  34 ++++--
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  54 +++++++--
 include/linux/context_tracking_work.h         |  24 ++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/module.h                        |   6 +-
 include/linux/objtool.h                       |  14 +++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  72 +++++++++++-
 kernel/kprobes.c                              |   8 +-
 kernel/kstack_erase.c                         |   6 +-
 kernel/module/main.c                          |  76 ++++++++++---
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/time/Kconfig                           |   5 +
 mm/vmalloc.c                                  |  34 +++++-
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  15 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 48 files changed, 736 insertions(+), 99 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.51.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[]
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.

While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.

Fabricating an intentional warning shows the change; previously:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section

now:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 9004fbc067693..12b6967e5fd0d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3325,7 +3325,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
 	return false;
 }
 
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn)
 {
 	static char pvname[19];
 	struct reloc *reloc;
@@ -3334,9 +3334,9 @@ static inline const char *call_dest_name(struct instruction *insn)
 	if (insn_call_dest(insn))
 		return insn_call_dest(insn)->name;
 
-	reloc = insn_reloc(NULL, insn);
+	reloc = insn_reloc(file, insn);
 	if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
-		idx = (reloc_addend(reloc) / sizeof(void *));
+		idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *));
 		snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
 		return pvname;
 	}
@@ -3415,17 +3415,19 @@ static int validate_call(struct objtool_file *file,
 {
 	if (state->noinstr && state->instr <= 0 &&
 	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with UACCESS enabled",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->df) {
-		WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with DF set",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 12b6967e5fd0d..1efa9f1bf16ba 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3363,7 +3363,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 
 	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
 		if (!target->sec->noinstr) {
-			WARN("pv_ops[%d]: %s", idx, target->name);
+			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
 			file->pv_ops[idx].clean = false;
 		}
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.

Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++----
 kernel/rcu/Kconfig.debug               | 15 +++++++++
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 7b8433d5a8efe..0b81248aa03e2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -18,12 +18,6 @@ enum ctx_state {
 	CT_STATE_MAX		= 4,
 };
 
-/* Odd value for watching, else even. */
-#define CT_RCU_WATCHING CT_STATE_MAX
-
-#define CT_STATE_MASK (CT_STATE_MAX - 1)
-#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
-
 struct context_tracking {
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 	/*
@@ -44,9 +38,45 @@ struct context_tracking {
 #endif
 };
 
+/*
+ * We cram two different things within the same atomic variable:
+ *
+ *                     CT_RCU_WATCHING_START  CT_STATE_START
+ *                                |                |
+ *                                v                v
+ *     MSB [ RCU watching counter ][ context_state ] LSB
+ *         ^                       ^
+ *         |                       |
+ * CT_RCU_WATCHING_END        CT_STATE_END
+ *
+ * Bits are used from the LSB upwards, so unused bits (if any) will always be in
+ * upper bits of the variable.
+ */
 #ifdef CONFIG_CONTEXT_TRACKING
+#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1)
+#define CT_STATE_START 0
+#define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
+#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
+#define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
+
+#define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
+
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+
+static_assert(CT_STATE_WIDTH        +
+	      CT_RCU_WATCHING_WIDTH +
+	      CT_UNUSED_WIDTH       ==
+	      CT_SIZE);
+
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
-#endif
+#endif	/* CONFIG_CONTEXT_TRACKING */
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 static __always_inline int __ct_state(void)
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 12e4c64ebae15..625d75392647b 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -213,4 +213,19 @@ config RCU_STRICT_GRACE_PERIOD
 	  when looking for certain types of RCU usage bugs, for example,
 	  too-short RCU read-side critical sections.
 
+
+config RCU_DYNTICKS_TORTURE
+	bool "Minimize RCU dynticks counter size"
+	depends on RCU_EXPERT && !COMPILE_TEST
+	default n
+	help
+	  This option sets the width of the dynticks counter to its
+	  minimum usable value.  This minimum width greatly increases
+	  the probability of flushing out bugs involving counter wrap,
+	  but it also increases the probability of extending grace period
+	  durations.  This Kconfig option should therefore be avoided in
+	  production due to the consequent increased probability of OOMs.
+
+	  This has no value for production and is only for testing.
+
 endmenu # "RCU Debugging"
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (2 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

We now have an RCU_EXPERT config for testing small-sized RCU dynticks
counter:  CONFIG_RCU_DYNTICKS_TORTURE.

Modify scenario TREE04 to exercise to use this config in order to test a
ridiculously small counter (2 bits).

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..67caf4276bb01 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (3 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

Some noinstr static branches may really need to be patched at runtime,
despite the resulting disruption.  Add DEFINE_STATIC_KEY_*_NOINSTR()
variants for those.  They don't do anything special yet; that will come
later.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/jump_label.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index fdb79dd1ebd8c..c4f6240ff4d95 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -388,6 +388,23 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE_RO(name)	\
 	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
 
+/*
+ * The _NOINSTR variants are used to tell objtool the static key is allowed to
+ * be used in noinstr code.
+ *
+ * They should almost never be used, as they prevent code patching IPIs from
+ * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running
+ * in pure userspace.
+ *
+ * If using one of these _NOINSTR variants, please add a comment above the
+ * definition with the rationale.
+ */
+#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_TRUE(name)
+
+#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_FALSE(name)
+
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 06/31] static_call: Add read-only-after-init static calls
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (4 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

If a noinstr static call only needs to be patched during boot, its key
can be made ro-after-init to ensure it will never be patched at runtime.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/static_call.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 78a77a4ae0ea8..ea6ca57e2a829 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -192,6 +192,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
+#define DEFINE_STATIC_CALL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = _func,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
@@ -200,6 +208,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = NULL,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (5 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

pv_sched_clock is updated in:
o __init vmware_paravirt_ops_setup()
o __init xen_init_time_common()
o kvm_sched_clock_init() <- __init kvmclock_init()
o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc693..34b6fa3fcc045 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -69,7 +69,7 @@ static u64 native_steal_clock(int cpu)
 }
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
-DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 08/31] x86/idle: Mark x86_idle static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (6 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

x86_idle is updated in:
o xen_set_default_idle() <- __init xen_arch_setup()
o __init select_idle_routine()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 4c718f8adc592..4f0c1868d43d9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -771,7 +771,7 @@ void __cpuidle default_idle(void)
 EXPORT_SYMBOL(default_idle);
 #endif
 
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle);
+DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
 
 static bool x86_idle_set(void)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (7 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static call is only ever updated in

  __init pv_time_init()
  __init xen_init_time_common()
  __init vmware_paravirt_ops_setup()
  __init xen_time_setup_guest(

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 34b6fa3fcc045..f320a9617b1d6 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -68,7 +68,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 10/31] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (8 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Andrew Jones, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static call is only ever updated in:

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
---
 arch/riscv/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
index fa6b0339a65de..dfe8808016fd8 100644
--- a/arch/riscv/kernel/paravirt.c
+++ b/arch/riscv/kernel/paravirt.c
@@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 11/31] loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (9 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/loongarch/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index b1b51f920b231..9ec3f5c31fdab 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -19,7 +19,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 12/31] arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (10 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm64/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index aa718d6a9274a..ad28fa23c9228 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 struct pv_time_stolen_time_region {
 	struct pvclock_vcpu_stolen_time __rcu *kaddr;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 13/31] arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (11 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static call is only ever updated in

  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c
index 7dd9806369fb0..632d8d5e06db3 100644
--- a/arch/arm/kernel/paravirt.c
+++ b/arch/arm/kernel/paravirt.c
@@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (12 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init
amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/events/amd/brs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 06f35a6b58a5b..71d7ba774a063 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -423,7 +423,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in)
 	}
 }
 
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
+DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
 EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
 
 void __init amd_brs_lopwr_init(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 15/31] sched/clock: Mark sched_clock_running key as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (13 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

sched_clock_running is only ever enabled in the __init functions
sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark
it __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index f5e6dd6a6b3af..c1a028e99d2cd 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -69,7 +69,7 @@ notrace unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
+static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (14 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

The static key is only ever enabled in

  __init hv_init_evmcs()

so mark it appropriately as __ro_after_init.

Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.c b/arch/x86/kvm/vmx/vmx_onhyperv.c
index b9a8b91166d02..ff3d80c9565bb 100644
--- a/arch/x86/kvm/vmx/vmx_onhyperv.c
+++ b/arch/x86/kvm/vmx/vmx_onhyperv.c
@@ -3,7 +3,7 @@
 #include "capabilities.h"
 #include "vmx_onhyperv.h"
 
-DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs);
+DEFINE_STATIC_KEY_FALSE_RO(__kvm_is_using_evmcs);
 
 /*
  * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (15 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

The static key is only ever updated in

  __init mmio_apply_mitigation

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/cpu/bugs.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d7fa03bf51b45..a2d7dc4d2a4a3 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -196,8 +196,11 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
  * Controls CPU Fill buffer clear before VMenter. This is a subset of
  * X86_FEATURE_CLEAR_CPU_BUF, and should only be enabled when KVM-only
  * mitigation is required.
+ *
+ * NOINSTR: This static key is only updated during init, so it can't be
+ * a cause of post-init interference.
  */
-DEFINE_STATIC_KEY_FALSE(cpu_buf_vm_clear);
+DEFINE_STATIC_KEY_FALSE_NOINSTR(cpu_buf_vm_clear);
 EXPORT_SYMBOL_GPL(cpu_buf_vm_clear);
 
 #undef pr_fmt
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in .noinstr
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (16 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
@ 2025-11-14 15:01 ` Valentin Schneider
  2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

cpu_buf_idle_clear is used in .noinstr code, and can be modified at
runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little
benefits for this key, as hotplug implies eventually going through
takedown_cpu() -> stop_machine_cpuslocked() which is going to cause
interference on all online CPUs anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/cpu/bugs.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a2d7dc4d2a4a3..39f6a2f9593f8 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -181,8 +181,13 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 DEFINE_STATIC_KEY_FALSE(switch_vcpu_ibpb);
 EXPORT_SYMBOL_GPL(switch_vcpu_ibpb);
 
-/* Control CPU buffer clear before idling (halt, mwait) */
-DEFINE_STATIC_KEY_FALSE(cpu_buf_idle_clear);
+/*
+ * Control CPU buffer clear before idling (halt, mwait)
+ *
+ * NOINSTR: This static key is updated during SMT hotplug which itself already
+ * causes some interference.
+ */
+DEFINE_STATIC_KEY_FALSE_NOINSTR(cpu_buf_idle_clear);
 EXPORT_SYMBOL_GPL(cpu_buf_idle_clear);
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (17 preceding siblings ...)
  2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
@ 2025-11-14 15:10 ` Valentin Schneider
  2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

__sched_clock_stable is used in .noinstr code, and can be modified at
runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync()
IPI has little benefits for this key, as NOHZ_FULL is incompatible with an
unstable TSC anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c1a028e99d2cd..5ee27349a4811 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -78,8 +78,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
  *
  * Similarly we start with __sched_clock_stable_early, thereby assuming we
  * will become stable, such that there's only a single 1 -> 0 transition.
+ *
+ * NOINSTR: an unstable TLC is incompatible with NOHZ_FULL, thus the text
+ * patching IPI would be the least of our concerns.
  */
-static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable);
 static int __sched_clock_stable_early = 1;
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (18 preceding siblings ...)
  2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
@ 2025-11-14 15:10 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

The VMX keys are used in .noinstr code, and can be modified at runtime
(/proc/kernel/vmx* write). However it is not expected that they will be
flipped during latency-sensitive operations, and thus shouldn't be a source
of interference for NOHZ_FULL CPUs wrt the text patching IPI.

Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
being patched, i.e. this is purely to tell objtool we're okay with updates
to that key causing IPIs and to silence the associated objtool warning.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 91b6f2f3edc2a..99936a2af6641 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -203,8 +203,15 @@ module_param(pt_mode, int, S_IRUGO);
 
 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
 
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
+/*
+ * NOINSTR: Both of these static keys end up being used in .noinstr sections,
+ * however they are only modified:
+ * - at init
+ * - from a /proc/kernel/vmx* write
+ * thus during latency-sensitive operations they should remain stable.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (19 preceding siblings ...)
  2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

stack_erasing_bypass is used in .noinstr code, and can be modified at runtime
(proc/sys/kernel/stack_erasing write). However it is not expected that it
will be  flipped during latency-sensitive operations, and thus shouldn't be
a source of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/kstack_erase.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/kstack_erase.c b/kernel/kstack_erase.c
index e49bb88b4f0a3..99ba44cf939bf 100644
--- a/kernel/kstack_erase.c
+++ b/kernel/kstack_erase.c
@@ -19,7 +19,11 @@
 #include <linux/sysctl.h>
 #include <linux/init.h>
 
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass);
+/*
+ * NOINSTR: This static key can only be modified via its sysctl interface. It is
+ * expected it will remain stable during latency-senstive operations.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
 
 #ifdef CONFIG_SYSCTL
 static int stack_erasing_sysctl(const struct ctl_table *table, int write,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (20 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

From: Josh Poimboeuf <jpoimboe@kernel.org>

Warn about static branches/calls in noinstr regions, unless the
corresponding key is RO-after-init or has been manually whitelisted with
DEFINE_STATIC_KEY_*_NOINSTR(().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/jump_label.h              | 17 +++--
 include/linux/objtool.h                 | 14 ++++
 include/linux/static_call.h             |  3 +
 tools/objtool/Documentation/objtool.txt | 34 +++++++++
 tools/objtool/check.c                   | 92 ++++++++++++++++++++++---
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/elf.h     |  1 +
 tools/objtool/include/objtool/special.h |  1 +
 tools/objtool/special.c                 | 15 +++-
 9 files changed, 162 insertions(+), 16 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index c4f6240ff4d95..0ea203ebbc493 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -76,6 +76,7 @@
 #include <linux/types.h>
 #include <linux/compiler.h>
 #include <linux/cleanup.h>
+#include <linux/objtool.h>
 
 extern bool static_key_initialized;
 
@@ -376,8 +377,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_TRUE(name)	\
 	struct static_key_true name = STATIC_KEY_TRUE_INIT
 
-#define DEFINE_STATIC_KEY_TRUE_RO(name)	\
-	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_RO(name)						\
+	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_TRUE(name)	\
 	extern struct static_key_true name
@@ -385,8 +387,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE(name)	\
 	struct static_key_false name = STATIC_KEY_FALSE_INIT
 
-#define DEFINE_STATIC_KEY_FALSE_RO(name)	\
-	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_RO(name)					\
+	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 /*
  * The _NOINSTR variants are used to tell objtool the static key is allowed to
@@ -400,10 +403,12 @@ struct static_key_false {
  * definition with the rationale.
  */
 #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_TRUE(name)
+	DEFINE_STATIC_KEY_TRUE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_FALSE(name)
+	DEFINE_STATIC_KEY_FALSE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index 46ebaa46e6c58..78b73bf65d9a2 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -34,6 +34,19 @@
 	static void __used __section(".discard.func_stack_frame_non_standard") \
 		*__func_stack_frame_non_standard_##func = func
 
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \
+	static void __used __section(".discard.noinstr_allowed") \
+		*__annotate_noinstr_allowed_##key = &key
+
+/*
+ * This is used to tell objtool that a given static key is safe to be used
+ * within .noinstr code, and it doesn't need to generate a warning about it.
+ *
+ * For more information, see tools/objtool/Documentation/objtool.txt,
+ * "non-RO static key usage in noinstr code"
+ */
+#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key)
+
 /*
  * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore
  * for the case where a function is intentionally missing frame pointer setup,
@@ -130,6 +143,7 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define __ASM_ANNOTATE(label, type) ""
 #define ASM_ANNOTATE(type)
+#define ANNOTATE_NOINSTR_ALLOWED(key)
 #else
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
 .endm
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index ea6ca57e2a829..0d4b16d348501 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -133,6 +133,7 @@
 
 #include <linux/types.h>
 #include <linux/cpu.h>
+#include <linux/objtool.h>
 #include <linux/static_call_types.h>
 
 #ifdef CONFIG_HAVE_STATIC_CALL
@@ -198,6 +199,7 @@ extern long __static_call_return0(void);
 		.func = _func,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
@@ -214,6 +216,7 @@ extern long __static_call_return0(void);
 		.func = NULL,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt
index 9e97fc25b2d8a..991e085e10d95 100644
--- a/tools/objtool/Documentation/objtool.txt
+++ b/tools/objtool/Documentation/objtool.txt
@@ -456,6 +456,40 @@ the objtool maintainers.
     these special names and does not use module_init() / module_exit()
     macros to create them.
 
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in noinstr code
+    file.o: warning: func()+0x2a: key: non-RO static call usage in noinstr code
+
+  This means that noinstr function func() uses a static key or
+  static call named 'key' which can be modified at runtime.  This is
+  discouraged because it prevents code patching IPIs from being
+  deferred.
+
+  You have the following options:
+
+  1) Check whether the static key/call in question is only modified
+     during init.  If so, define it as read-only-after-init with
+     DEFINE_STATIC_KEY_*_RO() or DEFINE_STATIC_CALL_RO().
+
+  2) Avoid the runtime patching.  For static keys this can be done by
+     using static_key_enabled() or by getting rid of the static key
+     altogether if performance is not a concern.
+
+     For static calls, something like the following could be done:
+
+       target = static_call_query(foo);
+       if (target == func1)
+	       func1();
+	else if (target == func2)
+		func2();
+	...
+
+  3) Silence the warning by defining the static key/call with
+     DEFINE_STATIC_*_NOINSTR().  This decision should not
+     be taken lightly as it may result in code patching IPIs getting
+     sent to isolated NOHZ_FULL CPUs running in pure userspace.  A
+     comment should be added above the definition explaining the
+     rationale for the decision.
+
 
 If the error doesn't seem to make sense, it could be a bug in objtool.
 Feel free to ask objtool maintainers for help.
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 1efa9f1bf16ba..474ba2fc87ac6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -982,6 +982,45 @@ static int create_direct_call_sections(struct objtool_file *file)
 	return 0;
 }
 
+static int read_noinstr_allowed(struct objtool_file *file)
+{
+	struct section *rsec;
+	struct symbol *sym;
+	struct reloc *reloc;
+
+	rsec = find_section_by_name(file->elf, ".rela.discard.noinstr_allowed");
+	if (!rsec)
+		return 0;
+
+	for_each_reloc(rsec, reloc) {
+		switch (reloc->sym->type) {
+		case STT_OBJECT:
+		case STT_FUNC:
+			sym = reloc->sym;
+			break;
+
+		case STT_SECTION:
+			sym = find_symbol_by_offset(reloc->sym->sec,
+						    reloc_addend(reloc));
+			if (!sym) {
+				WARN_FUNC(reloc->sym->sec, reloc_addend(reloc),
+					  "can't find static key/call symbol");
+				return -1;
+			}
+			break;
+
+		default:
+			WARN("unexpected relocation symbol type in %s: %d",
+			     rsec->name, reloc->sym->type);
+			return -1;
+		}
+
+		sym->noinstr_allowed = 1;
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -1868,6 +1907,8 @@ static int handle_jump_alt(struct objtool_file *file,
 		return -1;
 	}
 
+	orig_insn->key = special_alt->key;
+
 	if (opts.hack_jump_label && special_alt->key_addend & 2) {
 		struct reloc *reloc = insn_reloc(file, orig_insn);
 
@@ -2602,6 +2643,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_noinstr_allowed(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -3371,9 +3416,9 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 	return file->pv_ops[idx].clean;
 }
 
-static inline bool noinstr_call_dest(struct objtool_file *file,
-				     struct instruction *insn,
-				     struct symbol *func)
+static inline bool noinstr_call_allowed(struct objtool_file *file,
+					struct instruction *insn,
+					struct symbol *func)
 {
 	/*
 	 * We can't deal with indirect function calls at present;
@@ -3393,10 +3438,10 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 		return true;
 
 	/*
-	 * If the symbol is a static_call trampoline, we can't tell.
+	 * Only DEFINE_STATIC_CALL_*_RO allowed.
 	 */
 	if (func->static_call_tramp)
-		return true;
+		return func->noinstr_allowed;
 
 	/*
 	 * The __ubsan_handle_*() calls are like WARN(), they only happen when
@@ -3409,14 +3454,29 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 	return false;
 }
 
+static char *static_call_name(struct symbol *func)
+{
+	return func->name + strlen("__SCT__");
+}
+
 static int validate_call(struct objtool_file *file,
 			 struct instruction *insn,
 			 struct insn_state *state)
 {
-	if (state->noinstr && state->instr <= 0 &&
-	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
-		return 1;
+	if (state->noinstr && state->instr <= 0) {
+		struct symbol *dest = insn_call_dest(insn);
+
+		if (dest && dest->static_call_tramp) {
+			if (!dest->noinstr_allowed) {
+				WARN_INSN(insn, "%s: non-RO static call usage in noinstr",
+					  static_call_name(dest));
+			}
+
+		} else if (dest && !noinstr_call_allowed(file, insn, dest)) {
+			WARN_INSN(insn, "call to %s() leaves .noinstr.text section",
+				  call_dest_name(file, insn));
+			return 1;
+		}
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
@@ -3481,6 +3541,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
 	return 0;
 }
 
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+	if (state->noinstr && state->instr <= 0 && !insn->key->noinstr_allowed) {
+		WARN_INSN(insn, "%s: non-RO static key usage in noinstr",
+			  insn->key->name);
+		return 1;
+	}
+
+	return 0;
+}
+
 static struct instruction *next_insn_to_validate(struct objtool_file *file,
 						 struct instruction *insn)
 {
@@ -3673,6 +3744,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 		if (handle_insn_ops(insn, next_insn, &state))
 			return 1;
 
+		if (insn->key)
+			validate_static_key(insn, &state);
+
 		switch (insn->type) {
 
 		case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index 00fb745e72339..d79b08f55bcbc 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -81,6 +81,7 @@ struct instruction {
 	struct symbol *sym;
 	struct stack_op *stack_ops;
 	struct cfi_state *cfi;
+	struct symbol *key;
 };
 
 static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h
index df8434d3b7440..545deba57266a 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -71,6 +71,7 @@ struct symbol {
 	u8 frame_pointer     : 1;
 	u8 ignore	     : 1;
 	u8 nocfi             : 1;
+	u8 noinstr_allowed   : 1;
 	struct list_head pv_target;
 	struct reloc *relocs;
 	struct section *group_sec;
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 72d09c0adf1a1..e84d704f3f20e 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -18,6 +18,7 @@ struct special_alt {
 	bool group;
 	bool jump_or_nop;
 	u8 key_addend;
+	struct symbol *key;
 
 	struct section *orig_sec;
 	unsigned long orig_off;
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index c80fed8a840ee..d77f3fa4bbbc9 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -110,13 +110,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
 
 	if (entry->key) {
 		struct reloc *key_reloc;
+		struct symbol *key;
+		s64 key_addend;
 
 		key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key);
 		if (!key_reloc) {
 			ERROR_FUNC(sec, offset + entry->key, "can't find key reloc");
 			return -1;
 		}
-		alt->key_addend = reloc_addend(key_reloc);
+
+		key = key_reloc->sym;
+		key_addend = reloc_addend(key_reloc);
+
+		if (key->type == STT_SECTION)
+			key = find_symbol_by_offset(key->sec, key_addend & ~3);
+
+		/* embedded keys not supported */
+		if (key) {
+			alt->key = key;
+			alt->key_addend = key_addend;
+		}
 	}
 
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (21 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

As pointed out by Sean [1], is_kernel_noinstr_text() will return false for
an address contained within a module's .noinstr.text section. A later patch
will require checking whether a text address is noinstr, and this can
unfortunately be the case of modules - KVM is one such case.

A module's .noinstr.text section is already tracked as of commit
  66e9b0717102 ("kprobes: Prevent probes in .noinstr.text section")
for kprobe blacklisting purposes, but via an ad-hoc mechanism.

Add a MOD_NOINSTR_TEXT mem_type, and reorganize __layout_sections() so that
it maps all the sections in a single invocation.

[1]: http://lore.kernel.org/r/Z4qQL89GZ_gk0vpu@google.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/module.h |  6 ++--
 kernel/kprobes.c       |  8 ++---
 kernel/module/main.c   | 76 ++++++++++++++++++++++++++++++++----------
 3 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index e135cc79aceea..c0911973337c6 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -322,6 +322,7 @@ struct mod_tree_node {
 
 enum mod_mem_type {
 	MOD_TEXT = 0,
+	MOD_NOINSTR_TEXT,
 	MOD_DATA,
 	MOD_RODATA,
 	MOD_RO_AFTER_INIT,
@@ -492,8 +493,6 @@ struct module {
 	void __percpu *percpu;
 	unsigned int percpu_size;
 #endif
-	void *noinstr_text_start;
-	unsigned int noinstr_text_size;
 
 #ifdef CONFIG_TRACEPOINTS
 	unsigned int num_tracepoints;
@@ -622,12 +621,13 @@ static inline bool module_is_coming(struct module *mod)
         return mod->state == MODULE_STATE_COMING;
 }
 
-struct module *__module_text_address(unsigned long addr);
 struct module *__module_address(unsigned long addr);
+struct module *__module_text_address(unsigned long addr);
 bool is_module_address(unsigned long addr);
 bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr);
 bool is_module_percpu_address(unsigned long addr);
 bool is_module_text_address(unsigned long addr);
+bool is_module_noinstr_text_address(unsigned long addr);
 
 static inline bool within_module_mem_type(unsigned long addr,
 					  const struct module *mod,
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index ab8f9fc1f0d17..d60560dddec56 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2551,9 +2551,9 @@ static void add_module_kprobe_blacklist(struct module *mod)
 		kprobe_add_area_blacklist(start, end);
 	}
 
-	start = (unsigned long)mod->noinstr_text_start;
+	start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base;
 	if (start) {
-		end = start + mod->noinstr_text_size;
+		end = start + mod->mem[MOD_NOINSTR_TEXT].size;
 		kprobe_add_area_blacklist(start, end);
 	}
 }
@@ -2574,9 +2574,9 @@ static void remove_module_kprobe_blacklist(struct module *mod)
 		kprobe_remove_area_blacklist(start, end);
 	}
 
-	start = (unsigned long)mod->noinstr_text_start;
+	start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base;
 	if (start) {
-		end = start + mod->noinstr_text_size;
+		end = start + mod->mem[MOD_NOINSTR_TEXT].size;
 		kprobe_remove_area_blacklist(start, end);
 	}
 }
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c66b261849362..1f5bfdbb956a7 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1653,7 +1653,17 @@ bool module_init_layout_section(const char *sname)
 	return module_init_section(sname);
 }
 
-static void __layout_sections(struct module *mod, struct load_info *info, bool is_init)
+static bool module_noinstr_layout_section(const char *sname)
+{
+	return strstarts(sname, ".noinstr");
+}
+
+static bool module_default_layout_section(const char *sname)
+{
+	return !module_init_layout_section(sname) && !module_noinstr_layout_section(sname);
+}
+
+static void __layout_sections(struct module *mod, struct load_info *info)
 {
 	unsigned int m, i;
 
@@ -1662,20 +1672,44 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i
 	 *   Mask of excluded section header flags }
 	 */
 	static const unsigned long masks[][2] = {
+		/* Core */
+		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL },
+		{ SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 },
+		/* Init */
 		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
 		{ SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL },
 		{ SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL },
 		{ SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL },
-		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 }
+		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 },
 	};
-	static const int core_m_to_mem_type[] = {
+	static bool (*const section_filter[])(const char *) = {
+		/* Core */
+		module_default_layout_section,
+		module_noinstr_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		/* Init */
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+	};
+	static const int mem_type_map[] = {
+		/* Core */
 		MOD_TEXT,
+		MOD_NOINSTR_TEXT,
 		MOD_RODATA,
 		MOD_RO_AFTER_INIT,
 		MOD_DATA,
 		MOD_DATA,
-	};
-	static const int init_m_to_mem_type[] = {
+		/* Init */
 		MOD_INIT_TEXT,
 		MOD_INIT_RODATA,
 		MOD_INVALID,
@@ -1684,16 +1718,16 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i
 	};
 
 	for (m = 0; m < ARRAY_SIZE(masks); ++m) {
-		enum mod_mem_type type = is_init ? init_m_to_mem_type[m] : core_m_to_mem_type[m];
+		enum mod_mem_type type = mem_type_map[m];
 
 		for (i = 0; i < info->hdr->e_shnum; ++i) {
 			Elf_Shdr *s = &info->sechdrs[i];
 			const char *sname = info->secstrings + s->sh_name;
 
-			if ((s->sh_flags & masks[m][0]) != masks[m][0]
-			    || (s->sh_flags & masks[m][1])
-			    || s->sh_entsize != ~0UL
-			    || is_init != module_init_layout_section(sname))
+			if ((s->sh_flags & masks[m][0]) != masks[m][0] ||
+			    (s->sh_flags & masks[m][1])                ||
+			    s->sh_entsize != ~0UL                      ||
+			    !section_filter[m](sname))
 				continue;
 
 			if (WARN_ON_ONCE(type == MOD_INVALID))
@@ -1733,10 +1767,7 @@ static void layout_sections(struct module *mod, struct load_info *info)
 		info->sechdrs[i].sh_entsize = ~0UL;
 
 	pr_debug("Core section allocation order for %s:\n", mod->name);
-	__layout_sections(mod, info, false);
-
-	pr_debug("Init section allocation order for %s:\n", mod->name);
-	__layout_sections(mod, info, true);
+	__layout_sections(mod, info);
 }
 
 static void module_license_taint_check(struct module *mod, const char *license)
@@ -2625,9 +2656,6 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 	}
 #endif
 
-	mod->noinstr_text_start = section_objs(info, ".noinstr.text", 1,
-						&mod->noinstr_text_size);
-
 #ifdef CONFIG_TRACEPOINTS
 	mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
 					     sizeof(*mod->tracepoints_ptrs),
@@ -3872,12 +3900,26 @@ struct module *__module_text_address(unsigned long addr)
 	if (mod) {
 		/* Make sure it's within the text section. */
 		if (!within_module_mem_type(addr, mod, MOD_TEXT) &&
+		    !within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT) &&
 		    !within_module_mem_type(addr, mod, MOD_INIT_TEXT))
 			mod = NULL;
 	}
 	return mod;
 }
 
+bool is_module_noinstr_text_address(unsigned long addr)
+{
+	scoped_guard(preempt) {
+		struct module *mod = __module_address(addr);
+
+		/* Make sure it's within the .noinstr.text section. */
+		if (mod)
+			return within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT);
+	}
+
+	return false;
+}
+
 /* Don't grab lock, we're oopsing. */
 void print_modules(void)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (22 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Nicolas Saenz Julienne, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.

Presence of CT_RCU_WATCHING in the ct_state prevents queuing deferred work.

Later commits introduce the bit:callback mappings.

Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/Kconfig                                 |  9 +++
 arch/x86/Kconfig                             |  1 +
 arch/x86/include/asm/context_tracking_work.h | 16 +++++
 include/linux/context_tracking.h             | 21 ++++++
 include/linux/context_tracking_state.h       | 30 +++++---
 include/linux/context_tracking_work.h        | 24 +++++++
 kernel/context_tracking.c                    | 72 +++++++++++++++++++-
 kernel/time/Kconfig                          |  5 ++
 8 files changed, 166 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 61130b88964b9..6cc3965b8c9eb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1024,6 +1024,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a2..fa9229c0e0939 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -219,6 +219,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5f3b2d0977235
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(enum ct_work work)
+{
+	switch (work) {
+	case CT_WORK_n:
+		// Do work...
+		break;
+	case CT_WORK_MAX:
+		WARN_ON_ONCE(true);
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..0b0faa040e9b5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
@@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
 	return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long new, old, state;
+
+	state = arch_atomic_read(&ct->state);
+	do {
+		old = state;
+		new = old & ~CT_WORK_MASK;
+		new += incby;
+		state = arch_atomic_cmpxchg(&ct->state, old, new);
+	} while (old != state);
+
+	return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
 static __always_inline bool warn_rcu_enter(void)
 {
 	bool ret = false;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 0b81248aa03e2..d2c302133672f 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/static_key.h>
 #include <linux/context_tracking_irq.h>
+#include <linux/context_tracking_work.h>
 
 /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
 #define CT_NESTING_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
@@ -39,16 +40,19 @@ struct context_tracking {
 };
 
 /*
- * We cram two different things within the same atomic variable:
+ * We cram up to three different things within the same atomic variable:
  *
- *                     CT_RCU_WATCHING_START  CT_STATE_START
- *                                |                |
- *                                v                v
- *     MSB [ RCU watching counter ][ context_state ] LSB
- *         ^                       ^
- *         |                       |
- * CT_RCU_WATCHING_END        CT_STATE_END
+ *                     CT_RCU_WATCHING_START                  CT_STATE_START
+ *                                |         CT_WORK_START          |
+ *                                |               |                |
+ *                                v               v                v
+ *     MSB [ RCU watching counter ][ context work ][ context_state ] LSB
+ *         ^                       ^               ^
+ *         |                       |               |
+ *         |                  CT_WORK_END          |
+ * CT_RCU_WATCHING_END                        CT_STATE_END
  *
+ * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n
  * Bits are used from the LSB upwards, so unused bits (if any) will always be in
  * upper bits of the variable.
  */
@@ -59,18 +63,24 @@ struct context_tracking {
 #define CT_STATE_START 0
 #define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
 
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0)
+#define	CT_WORK_START (CT_STATE_END + 1)
+#define CT_WORK_END   (CT_WORK_START + CT_WORK_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH)
 #define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
-#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_START     (CT_WORK_END + 1)
 #define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
 #define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
 
 #define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_WORK_MASK         GENMASK(CT_WORK_END,         CT_WORK_START)
 #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
 
 #define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
 
 static_assert(CT_STATE_WIDTH        +
+	      CT_WORK_WIDTH         +
 	      CT_RCU_WATCHING_WIDTH +
 	      CT_UNUSED_WIDTH       ==
 	      CT_SIZE);
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..3742f461183ac
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CT_WORK_n_OFFSET,
+	CT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
+};
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index fb5be6e9b423f..5a3cc4ce4d8a6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -25,6 +25,9 @@
 #include <linux/kprobes.h>
 #include <trace/events/rcu.h>
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+#include <asm/context_tracking_work.h>
+#endif
 
 DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 #ifdef CONFIG_CONTEXT_TRACKING_IDLE
@@ -72,6 +75,70 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+	int bit;
+
+	seq = (seq & CT_WORK_MASK) >> CT_WORK_START;
+
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &seq, CT_WORK_MAX)
+		arch_context_tracking_work(BIT(bit));
+}
+
+/**
+ * ct_set_cpu_work - set work to be run at next kernel context entry
+ *
+ * If @cpu is not currently executing in kernelspace, it will execute the
+ * callback mapped to @work (see arch_context_tracking_work()) at its next
+ * entry into ct_kernel_enter_state().
+ *
+ * If it is already executing in kernelspace, this will be a no-op.
+ */
+bool ct_set_cpu_work(unsigned int cpu, enum ct_work work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old;
+	bool ret = false;
+
+	if (!ct->active)
+		return false;
+
+	preempt_disable();
+
+	old = atomic_read(&ct->state);
+
+	/*
+	 * The work bit must only be set if the target CPU is not executing
+	 * in kernelspace.
+	 * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we
+	 * know for sure the CPU is executing in the kernel whether that be in
+	 * NMI, IRQ or process context.
+	 * Clear CT_RCU_WATCHING here and let the cmpxchg do the check for us;
+	 * the state could change between the atomic_read() and the cmpxchg().
+	 */
+	old &= ~CT_RCU_WATCHING;
+	/*
+	 * Try setting the work until either
+	 * - the target CPU has entered kernelspace
+	 * - the work has been set
+	 */
+	do {
+		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
+	} while (!ret && !(old & CT_RCU_WATCHING));
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -88,7 +155,7 @@ static noinstr void ct_kernel_exit_state(int offset)
 	rcu_task_trace_heavyweight_enter();  // Before CT state update!
 	// RCU is still watching.  Better not be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !rcu_is_watching_curr_cpu());
-	(void)ct_state_inc(offset);
+	(void)ct_state_inc_clear_work(offset);
 	// RCU is no longer watching.
 }
 
@@ -99,7 +166,7 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
-	int seq;
+	unsigned long seq;
 
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
@@ -107,6 +174,7 @@ static noinstr void ct_kernel_enter_state(int offset)
 	 * critical section.
 	 */
 	seq = ct_state_inc(offset);
+	ct_work_flush(seq);
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_task_trace_heavyweight_exit();  // After CT state update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f7836ce..1a0c027aad141 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (23 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Peter Zijlstra (Intel), Nicolas Saenz Julienne,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Sami Tolvanen, David S. Miller, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Mel Gorman, Andrew Morton, Masahiro Yamada,
	Han Shen, Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Juri Lelli, Clark Williams, Yair Podemsky, Marcelo Tosatti,
	Daniel Wagner, Petr Tesarik, Shrikanth Hegde

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- static calls
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys and static calls.

Any static key used in early entry code should be only forever-enabled at
boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions
are explicitly marked as allowed in .noinstr and will always generate an
IPI when flipped.

The same applies to static calls - they should be only updated at boot
time, or manually marked as an exception.

Objtool is now able to point at static keys/calls that don't respect this,
and all static keys/calls used in early entry code have now been verified
as behaving appropriately.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/x86/include/asm/context_tracking_work.h |  6 ++-
 arch/x86/include/asm/text-patching.h         |  1 +
 arch/x86/kernel/alternative.c                | 39 +++++++++++++++++---
 arch/x86/kernel/kprobes/core.c               |  4 +-
 arch/x86/kernel/kprobes/opt.c                |  4 +-
 arch/x86/kernel/module.c                     |  2 +-
 include/asm-generic/sections.h               | 15 ++++++++
 include/linux/context_tracking_work.h        |  4 +-
 8 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5f3b2d0977235..485b32881fde5 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include <asm/sync_core.h>
+
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
 	switch (work) {
-	case CT_WORK_n:
-		// Do work...
+	case CT_WORK_SYNC:
+		sync_core();
 		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index f2d142a0a862e..ca989b0b6c0ae 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,7 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void smp_text_poke_sync_each_cpu(void);
+extern void smp_text_poke_sync_each_cpu_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 #define text_poke_copy text_poke_copy
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8ee5ff547357a..ce8989e02ae77 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -6,6 +6,7 @@
 #include <linux/vmalloc.h>
 #include <linux/memory.h>
 #include <linux/execmem.h>
+#include <linux/context_tracking.h>
 
 #include <asm/text-patching.h>
 #include <asm/insn.h>
@@ -2708,9 +2709,24 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CT_WORK_SYNC);
+}
+
+static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func)
+{
+	on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void smp_text_poke_sync_each_cpu(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	__smp_text_poke_sync_each_cpu(NULL);
+}
+
+void smp_text_poke_sync_each_cpu_deferrable(void)
+{
+	__smp_text_poke_sync_each_cpu(do_sync_core_defer_cond);
 }
 
 /*
@@ -2880,6 +2896,7 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs)
  */
 void smp_text_poke_batch_finish(void)
 {
+	smp_cond_func_t cond = do_sync_core_defer_cond;
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
 	int do_sync;
@@ -2916,11 +2933,21 @@ void smp_text_poke_batch_finish(void)
 	 * First step: add a INT3 trap to the address that will be patched.
 	 */
 	for (i = 0; i < text_poke_array.nr_entries; i++) {
-		text_poke_array.vec[i].old = *(u8 *)text_poke_addr(&text_poke_array.vec[i]);
-		text_poke(text_poke_addr(&text_poke_array.vec[i]), &int3, INT3_INSN_SIZE);
+		void *addr = text_poke_addr(&text_poke_array.vec[i]);
+
+		/*
+		 * There's no safe way to defer IPIs for patching text in
+		 * .noinstr, record whether there is at least one such poke.
+		 */
+		if (is_kernel_noinstr_text((unsigned long)addr) ||
+		    is_module_noinstr_text_address((unsigned long)addr))
+			cond = NULL;
+
+		text_poke_array.vec[i].old = *((u8 *)addr);
+		text_poke(addr, &int3, INT3_INSN_SIZE);
 	}
 
-	smp_text_poke_sync_each_cpu();
+	__smp_text_poke_sync_each_cpu(cond);
 
 	/*
 	 * Second step: update all but the first byte of the patched range.
@@ -2982,7 +3009,7 @@ void smp_text_poke_batch_finish(void)
 		 * not necessary and we'd be safe even without it. But
 		 * better safe than sorry (plus there's not only Intel).
 		 */
-		smp_text_poke_sync_each_cpu();
+		__smp_text_poke_sync_each_cpu(cond);
 	}
 
 	/*
@@ -3003,7 +3030,7 @@ void smp_text_poke_batch_finish(void)
 	}
 
 	if (do_sync)
-		smp_text_poke_sync_each_cpu();
+		__smp_text_poke_sync_each_cpu(cond);
 
 	/*
 	 * Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 3863d7709386f..51957cd737f52 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -790,7 +790,7 @@ void arch_arm_kprobe(struct kprobe *p)
 	u8 int3 = INT3_INSN_OPCODE;
 
 	text_poke(p->addr, &int3, 1);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 	perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
 }
 
@@ -800,7 +800,7 @@ void arch_disarm_kprobe(struct kprobe *p)
 
 	perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
 	text_poke(p->addr, &p->opcode, 1);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 0aabd4c4e2c4f..eada8dca1c2e8 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	       JMP32_INSN_SIZE - INT3_INSN_SIZE);
 
 	text_poke(addr, new, INT3_INSN_SIZE);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 	text_poke(addr + INT3_INSN_SIZE,
 		  new + INT3_INSN_SIZE,
 		  JMP32_INSN_SIZE - INT3_INSN_SIZE);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 
 	perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
 }
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 0ffbae902e2fe..c6c4f391eb465 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -206,7 +206,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 				   write, apply);
 
 	if (!early) {
-		smp_text_poke_sync_each_cpu();
+		smp_text_poke_sync_each_cpu_deferrable();
 		mutex_unlock(&text_mutex);
 	}
 
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index 0755bc39b0d80..7d2403014010e 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr)
 	       addr < (unsigned long)_einittext;
 }
 
+
+/**
+ * is_kernel_noinstr_text - checks if the pointer address is located in the
+ *                    .noinstr section
+ *
+ * @addr: address to check
+ *
+ * Returns: true if the address is located in .noinstr, false otherwise.
+ */
+static inline bool is_kernel_noinstr_text(unsigned long addr)
+{
+	return addr >= (unsigned long)__noinstr_text_start &&
+	       addr < (unsigned long)__noinstr_text_end;
+}
+
 /**
  * __is_kernel_text - checks if the pointer address is located in the
  *                    .text section
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 3742f461183ac..abd3c196855f4 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,12 @@
 #include <linux/bitops.h>
 
 enum {
-	CT_WORK_n_OFFSET,
+	CT_WORK_SYNC_OFFSET,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
-	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
 	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely()
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (24 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

A later commit will add some early entry code that only needs to be
executed if nohz_full is present on the cmdline, not just if
CONFIG_NO_HZ_FULL is compiled in. Add an ASM-callable static branch macro.

Note that I haven't found a way to express unlikely (i.e. out-of-line)
static branches in ASM macros without using extra jumps, which kind of
defeats the purpose. Consider:

  .macro FOOBAR
	  // Key enabled:  JMP .Ldostuff_\@
	  // Key disabled: NOP
	  STATIC_BRANCH_UNLIKELY key, .Ldostuff_\@ // Patched to JMP if enabled
	  jmp .Lend_\@
  .Ldostuff_\@:
	  <dostuff>
  .Lend_\@:
  .endm

Instead, this should be expressed as a likely (i.e. in-line) static key:

  .macro FOOBAR
	  // Key enabled:  NOP
	  // Key disabled: JMP .Lend_\@
	  STATIC_BRANCH_LIKELY key, .Lend\@ // Patched to NOP if enabled
	  <dostuff>
  .Lend_\@:
  .endm

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/jump_label.h | 33 ++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 61dd1dee7812e..3c9ba3948e225 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -7,7 +7,38 @@
 #include <asm/asm.h>
 #include <asm/nops.h>
 
-#ifndef __ASSEMBLER__
+#ifdef __ASSEMBLER__
+
+/*
+ * There isn't a neat way to craft unlikely static branches in ASM, so they
+ * all have to be expressed as likely (inline) static branches. This macro
+ * thus assumes a "likely" usage.
+ */
+.macro ARCH_STATIC_BRANCH_LIKELY_ASM key, label, jump, hack
+1:
+.if \jump || \hack
+	jmp \label
+.else
+	.byte BYTES_NOP5
+.endif
+	.pushsection __jump_table, "aw"
+	_ASM_ALIGN
+	.long 1b - .
+	.long \label - .
+	/* LIKELY so bit0=1, bit1=hack */
+	_ASM_PTR \key + 1 + (\hack << 1) - .
+	.popsection
+.endm
+
+.macro STATIC_BRANCH_TRUE_LIKELY key, label
+	ARCH_STATIC_BRANCH_LIKELY_ASM \key, \label, 0, IS_ENABLED(CONFIG_HAVE_JUMP_LABEL_HACK)
+.endm
+
+.macro STATIC_BRANCH_FALSE_LIKELY key, label
+	ARCH_STATIC_BRANCH_LIKELY_ASM \key, \label, 1, 0
+.endm
+
+#else /* !__ASSEMBLER__ */
 
 #include <linux/stringify.h>
 #include <linux/types.h>
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (25 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

A later commit will introduce a pure-assembly INVPCID invocation, allow
assembly files to get the type definitions.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/invpcid.h | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h
index 734482afbf81d..27ae75c2d7fed 100644
--- a/arch/x86/include/asm/invpcid.h
+++ b/arch/x86/include/asm/invpcid.h
@@ -2,6 +2,13 @@
 #ifndef _ASM_X86_INVPCID
 #define _ASM_X86_INVPCID
 
+#define INVPCID_TYPE_INDIV_ADDR		0
+#define INVPCID_TYPE_SINGLE_CTXT	1
+#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
+#define INVPCID_TYPE_ALL_NON_GLOBAL	3
+
+#ifndef __ASSEMBLER__
+
 static inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
 {
@@ -17,11 +24,6 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 		     :: [desc] "m" (desc), [type] "r" (type) : "memory");
 }
 
-#define INVPCID_TYPE_INDIV_ADDR		0
-#define INVPCID_TYPE_SINGLE_CTXT	1
-#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
-#define INVPCID_TYPE_ALL_NON_GLOBAL	3
-
 /* Flush all mappings for a given pcid and addr, not including globals. */
 static inline void invpcid_flush_one(unsigned long pcid,
 				     unsigned long addr)
@@ -47,4 +49,6 @@ static inline void invpcid_flush_all_nonglobals(void)
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+#endif /* __ASSEMBLER__ */
+
 #endif /* _ASM_X86_INVPCID */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (26 preceding siblings ...)
  2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Later commits will rely on this information to defer kernel TLB flush
IPIs. Update it when switching to and from the kernel CR3.

This will only be really useful for NOHZ_FULL CPUs, but it should be
cheaper to unconditionally update a never-used per-CPU variable living in
its own cacheline than to check a shared cpumask such as
  housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
at every entry.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
Per the cover letter, I really hate this, but couldn't come up with
anything better.
---
 arch/x86/entry/calling.h        | 21 +++++++++++++++++++++
 arch/x86/entry/syscall_64.c     |  4 ++++
 arch/x86/include/asm/tlbflush.h |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 77e2d920a6407..0187c0ea2fddb 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -9,6 +9,7 @@
 #include <asm/ptrace-abi.h>
 #include <asm/msr.h>
 #include <asm/nospec-branch.h>
+#include <asm/jump_label.h>

 /*

@@ -170,11 +171,28 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

+.macro COALESCE_TLBI
+#ifdef CONFIG_COALESCE_TLBI
+	STATIC_BRANCH_FALSE_LIKELY housekeeping_overridden, .Lend_\@
+	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
+.Lend_\@:
+#endif // CONFIG_COALESCE_TLBI
+.endm
+
+.macro NOTE_SWITCH_TO_USER_CR3
+#ifdef CONFIG_COALESCE_TLBI
+	STATIC_BRANCH_FALSE_LIKELY housekeeping_overridden, .Lend_\@
+	movl     $0, PER_CPU_VAR(kernel_cr3_loaded)
+.Lend_\@:
+#endif // CONFIG_COALESCE_TLBI
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
+	COALESCE_TLBI
 .Lend_\@:
 .endm

@@ -182,6 +200,7 @@ For 32-bit we have the following conventions - kernel is built with
	PER_CPU_VAR(cpu_tlbstate + TLB_STATE_user_pcid_flush_mask)

 .macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req
+	NOTE_SWITCH_TO_USER_CR3
	mov	%cr3, \scratch_reg

	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -241,6 +260,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
+	COALESCE_TLBI

 .Ldone_\@:
 .endm
@@ -257,6 +277,7 @@ For 32-bit we have the following conventions - kernel is built with
	bt	$PTI_USER_PGTABLE_BIT, \save_reg
	jnc	.Lend_\@

+	NOTE_SWITCH_TO_USER_CR3
	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

	/*
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index b6e68ea98b839..2589d232e0ba1 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -83,6 +83,10 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
	return false;
 }

+#ifdef CONFIG_COALESCE_TLBI
+DEFINE_PER_CPU(bool, kernel_cr3_loaded) = true;
+#endif
+
 /* Returns true to return using SYSRET, or false to use IRET */
 __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
 {
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 00daedfefc1b0..e39ae95b85072 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,9 @@
 #include <asm/pgtable.h>

 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+#ifdef CONFIG_COALESCE_TLBI
+DECLARE_PER_CPU(bool, kernel_cr3_loaded);
+#endif

 void __flush_tlb_all(void);

--
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (27 preceding siblings ...)
  2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-19 14:31   ` Andy Lutomirski
  2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
                   ` (2 subsequent siblings)
  31 siblings, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Deferring kernel range TLB flushes requires the guarantee that upon
entering the kernel, no stale entry may be accessed. The simplest way to
provide such a guarantee is to issue an unconditional flush upon switching
to the kernel CR3, as this is the pivoting point where such stale entries
may be accessed.

As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
CPUs.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/entry/calling.h      | 25 ++++++++++++++++++++++---
 arch/x86/kernel/asm-offsets.c |  1 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 0187c0ea2fddb..620203ef04e9f 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -10,6 +10,7 @@
 #include <asm/msr.h>
 #include <asm/nospec-branch.h>
 #include <asm/jump_label.h>
+#include <asm/invpcid.h>

 /*

@@ -171,9 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

-.macro COALESCE_TLBI
+.macro COALESCE_TLBI scratch_reg:req
 #ifdef CONFIG_COALESCE_TLBI
	STATIC_BRANCH_FALSE_LIKELY housekeeping_overridden, .Lend_\@
+	/* No point in doing this for housekeeping CPUs */
+	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
+	bt	\scratch_reg, tick_nohz_full_mask(%rip)
+	jnc	.Lend_tlbi_\@
+
+	ALTERNATIVE "jmp .Lcr4_\@", "", X86_FEATURE_INVPCID
+	movq $(INVPCID_TYPE_ALL_INCL_GLOBAL), \scratch_reg
+	/* descriptor is all zeroes, point at the zero page */
+	invpcid empty_zero_page(%rip), \scratch_reg
+	jmp .Lend_tlbi_\@
+.Lcr4_\@:
+	/* Note: this gives CR4 pinning the finger */
+	movq PER_CPU_VAR(cpu_tlbstate + TLB_STATE_cr4), \scratch_reg
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+.Lend_tlbi_\@:
	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
 .Lend_\@:
 #endif // CONFIG_COALESCE_TLBI
@@ -192,7 +211,7 @@ For 32-bit we have the following conventions - kernel is built with
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg
 .Lend_\@:
 .endm

@@ -260,7 +279,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg

 .Ldone_\@:
 .endm
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 32ba599a51f88..deb92e9c8923d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -106,6 +106,7 @@ static void __used common(void)

	/* TLB state for the entry code */
	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_cr4, tlb_state, cr4);

	/* Layout info for cpu_entry_area */
	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
--
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (28 preceding siblings ...)
  2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-19 18:31   ` Dave Hansen
  2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
  2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
  31 siblings, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Previous commits have added an unconditional TLB flush right after
switching to the kernel CR3 on NOHZ_FULL CPUs, and a software signal to
determine whether a CPU has its kernel CR3 loaded.

Using these two components, we can now safely defer kernel TLB flush IPIs
targeting NOHZ_FULL CPUs executing in userspace (i.e. with the user CR3
loaded).

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h |  3 +++
 arch/x86/mm/tlb.c               | 34 ++++++++++++++++++++++++++-------
 mm/vmalloc.c                    | 34 ++++++++++++++++++++++++++++-----
 3 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e39ae95b85072..6d533afd70952 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -321,6 +321,9 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
				unsigned long end, unsigned int stride_shift,
				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+#ifdef CONFIG_COALESCE_TLBI
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
+#endif

 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5d221709353e0..1ce80f8775e7a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/sched/isolation.h>

 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1529,23 +1530,24 @@ static void do_kernel_range_flush(void *info)
		flush_tlb_one_kernel(addr);
 }

-static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+static void kernel_tlb_flush_all(smp_cond_func_t cond, struct flush_tlb_info *info)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all();
	else
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond, do_flush_tlb_all, NULL, 1);
 }

-static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+static void kernel_tlb_flush_range(smp_cond_func_t cond, struct flush_tlb_info *info)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_kernel_range_flush(info);
	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond, do_kernel_range_flush, info, 1);
 }

-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond, unsigned long start, unsigned long end)
 {
	struct flush_tlb_info *info;

@@ -1555,13 +1557,31 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
				  TLB_GENERATION_INVALID);

	if (info->end == TLB_FLUSH_ALL)
-		kernel_tlb_flush_all(info);
+		kernel_tlb_flush_all(cond, info);
	else
-		kernel_tlb_flush_range(info);
+		kernel_tlb_flush_range(cond, info);

	put_flush_tlb_info();
 }

+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+#ifdef CONFIG_COALESCE_TLBI
+static bool flush_tlb_kernel_cond(int cpu, void *info)
+{
+	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
+	       per_cpu(kernel_cr3_loaded, cpu);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(flush_tlb_kernel_cond, start, end);
+}
+#endif
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 798b2ed21e460..76ec10d56623b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -494,6 +494,30 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
	__vunmap_range_noflush(start, end);
 }

+#ifdef CONFIG_COALESCE_TLBI
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is a recipe for disaster.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -507,7 +531,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
 {
	flush_cache_vunmap(addr, end);
	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }

 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2339,7 +2363,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,

	nr_purge_nodes = cpumask_weight(&purge_nodes);
	if (nr_purge_nodes > 0) {
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);

		/* One extra worker is per a lazy_max_pages() full set minus one. */
		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2442,7 +2466,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
	flush_cache_vunmap(va->va_start, va->va_end);
	vunmap_range_noflush(va->va_start, va->va_end);
	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);

	free_vmap_area_noflush(va);
 }
@@ -2890,7 +2914,7 @@ static void vb_free(unsigned long addr, unsigned long size)
	vunmap_range_noflush(addr, addr + size);

	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);

	spin_lock(&vb->lock);

@@ -2955,7 +2979,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
	free_purged_blocks(&purge_list);

	if (!__purge_vmap_area_lazy(start, end, false) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
	mutex_unlock(&vmap_purge_lock);
 }

--
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (29 preceding siblings ...)
  2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
@ 2025-11-14 15:14 ` Valentin Schneider
  2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
  31 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-14 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

Previous patches have introduced a mechanism to prevent kernel text updates
from inducing interference on isolated CPUs. A similar action is required
for kernel-range TLB flushes in order to silence the biggest remaining
cause of isolated CPU IPI interference.

These flushes are mostly caused by vmalloc manipulations - e.g. on x86 with
CONFIG_VMAP_STACK, spawning enough processes will easily trigger
flushes. Unfortunately, the newly added context_tracking IPI deferral
mechanism cannot be leveraged for TLB flushes, as the deferred work would
be executed too late. Consider the following execution flow:

  <userspace>

  !interrupt!

  SWITCH_TO_KERNEL_CR3 // vmalloc range becomes accessible

  idtentry_func_foo()
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush() // deferred flush would be done here

Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry. Dave had suggested
hacking up something within SWITCH_TO_KERNEL_CR3 itself, which is what has
been implemented in the previous patches.

Make kernel-range TLB flush deferral available via CONFIG_COALESCE_TLBI.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/Kconfig | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa9229c0e0939..04f9d6496bbbc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2189,6 +2189,23 @@ config ADDRESS_MASKING
	  The capability can be used for efficient address sanitizers (ASAN)
	  implementation and for optimizations in JITs.

+config COALESCE_TLBI
+       def_bool n
+       prompt "Coalesce kernel TLB flushes for NOHZ-full CPUs"
+       depends on X86_64 && MITIGATION_PAGE_TABLE_ISOLATION && NO_HZ_FULL
+       help
+	 TLB flushes for kernel addresses can lead to IPIs being sent to
+	 NOHZ-full CPUs, thus kicking them out of userspace.
+
+	 This option coalesces kernel-range TLB flushes for NOHZ-full CPUs into
+	 a single flush executed at kernel entry, right after switching to the
+	 kernel page table. Note that this flush is unconditionnal, even if no
+	 remote flush was issued during the previous userspace execution window.
+
+	 This obviously makes the user->kernel transition overhead even worse.
+
+	 If unsure, say N.
+
 config HOTPLUG_CPU
	def_bool y
	depends on SMP
--
2.51.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (30 preceding siblings ...)
  2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
@ 2025-11-14 16:20 ` Andy Lutomirski
  2025-11-14 17:22   ` Andy Lutomirski
  31 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-14 16:20 UTC (permalink / raw)
  To: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
>

> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
> the callbacks immediately. Anything that only affects kernelspace can wait
> until the next user->kernel transition, providing it can be executed "early
> enough" in the entry code.
>

I want to point out that there's another option here, although anyone trying to implement it would be fighting against quite a lot of history.

Logically, each CPU is in one of a handful of states: user mode, idle, normal kernel mode (possibly subdivided into IRQ, etc), and a handful of very narrow windows, hopefully uninstrumented and not accessing any PTEs that might be invalid, in the entry and exit paths where any state in memory could be out of sync with actual CPU state.  (The latter includes right after the CPU switches to kernel mode, for example.)  And NMI and MCE and whatever weird "security" entry types that Intel and AMD love to add.

The way the kernel *currently* deals with this has two big historical oddities:

1. The entry and exit code cares about ti_flags, which is per-*task*, which means that atomically poking it from other CPUs involves the runqueue lock or other shenanigans (see the idle nr_polling code for example), and also that it's not accessible from the user page tables if PTI is on.

2. The actual heavyweight atomic part (context tracking) was built for RCU, and it's sort or bolted on, and, as you've observed in this series, it's really quite awkward to do things that aren't RCU using context tracking.

If this were a greenfield project, I think there's a straightforward approach that's much nicer: stick everything into a single percpu flags structure.  Imagine we have cpu_flags, which tracks both the current state of the CPU and what work needs to be done on state changes.  On exit to user mode, we would atomically set the mode to USER and make sure we don't touch anything like vmalloc space after that.  On entry back to kernel mode, we would avoid vmalloc space, etc, then atomically switch to kernel mode and read out whatever deferred work is needed.  As an optimization, if nothing in the current configuration needs atomic state tracking, the state could be left at USER_OR_KERNEL and the overhead of an extra atomic op at entry and exit could be avoided.

And RCU would hook into *that* instead of having its own separate set of hooks.

I think that actually doing this would be a big improvement and would also be a serious project.  There's a lot of code that would get touched, and the existing context tracking code is subtle and confusing.  And, as mentioned, ti_flags has the wrong scope.

It's *possible* that one could avoid making ti_flags percpu either by extensive use of the runqueue locks or by borrowing a kludge from the idle code.  For the latter, right now, the reason that the wake-from-idle code works is that the optimized path only happens if the idle thread/cpu is "polling", and it's impossible for the idle ti_flags to be polling while the CPU isn't actually idle.  We could similarly observe that, if a ti_flags says it's in USER mode *and* is on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone could try shoving the CPU number into ti_flags :-p   (USER means actually user or in the late exit / early entry path.)

Anyway, benefits of this whole approach would include considerably (IMO) increased comprehensibility compared to the current tangled ct code and much more straightforward addition of new things that happen to a target CPU conditionally depending on its mode.  And, if the flags word was actually per cpu, it could be mapped such that SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write (and maybe CR4/invpcid depending on whether a zapped mapping is global) and the flush bit could depend on whether a flush is needed.  And there would be basically no chance that a bug that accessed invalidated-but-not-flushed kernel data could be undetected -- in PTI mode, any such access would page fault!  Similarly, if kernel text pokes deferred the flush and serialization, the only code that could execute before noticing the deferred flush would be the user-CR3 code.

Oh, any another primitive would be possible: one CPU could plausibly execute another CPU's interrupts or soft-irqs or whatever by taking a special lock that would effectively pin the remote CPU in user mode -- you'd set a flag in the target cpu_flags saying "pin in USER mode" and the transition on that CPU to kernel mode would then spin on entry to kernel mode and wait for the lock to be released.  This could plausibly get a lot of the on_each_cpu callers to switch over in one fell swoop: anything that needs to synchronize to the remote CPU but does not need to poke its actual architectural state could be executed locally while the remote CPU is pinned.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
@ 2025-11-14 17:22   ` Andy Lutomirski
  2025-11-14 18:14     ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-14 17:22 UTC (permalink / raw)
  To: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde



On Fri, Nov 14, 2025, at 8:20 AM, Andy Lutomirski wrote:
> On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
>> Context
>> =======
>>
>> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
>> pure-userspace application get regularly interrupted by IPIs sent from
>> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
>> leading to various on_each_cpu() calls, e.g.:
>>
>
>> The heart of this series is the thought that while we cannot remove NOHZ_FULL
>> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
>> the callbacks immediately. Anything that only affects kernelspace can wait
>> until the next user->kernel transition, providing it can be executed "early
>> enough" in the entry code.
>>
>
> I want to point out that there's another option here, although anyone 
> trying to implement it would be fighting against quite a lot of history.
>
> Logically, each CPU is in one of a handful of states: user mode, idle, 
> normal kernel mode (possibly subdivided into IRQ, etc), and a handful 
> of very narrow windows, hopefully uninstrumented and not accessing any 
> PTEs that might be invalid, in the entry and exit paths where any state 
> in memory could be out of sync with actual CPU state.  (The latter 
> includes right after the CPU switches to kernel mode, for example.)  
> And NMI and MCE and whatever weird "security" entry types that Intel 
> and AMD love to add.
>
> The way the kernel *currently* deals with this has two big historical oddities:
>
> 1. The entry and exit code cares about ti_flags, which is per-*task*, 
> which means that atomically poking it from other CPUs involves the 
> runqueue lock or other shenanigans (see the idle nr_polling code for 
> example), and also that it's not accessible from the user page tables 
> if PTI is on.
>
> 2. The actual heavyweight atomic part (context tracking) was built for 
> RCU, and it's sort or bolted on, and, as you've observed in this 
> series, it's really quite awkward to do things that aren't RCU using 
> context tracking.
>
> If this were a greenfield project, I think there's a straightforward 
> approach that's much nicer: stick everything into a single percpu flags 
> structure.  Imagine we have cpu_flags, which tracks both the current 
> state of the CPU and what work needs to be done on state changes.  On 
> exit to user mode, we would atomically set the mode to USER and make 
> sure we don't touch anything like vmalloc space after that.  On entry 
> back to kernel mode, we would avoid vmalloc space, etc, then atomically 
> switch to kernel mode and read out whatever deferred work is needed.  
> As an optimization, if nothing in the current configuration needs 
> atomic state tracking, the state could be left at USER_OR_KERNEL and 
> the overhead of an extra atomic op at entry and exit could be avoided.
>
> And RCU would hook into *that* instead of having its own separate set of hooks.
>
> I think that actually doing this would be a big improvement and would 
> also be a serious project.  There's a lot of code that would get 
> touched, and the existing context tracking code is subtle and 
> confusing.  And, as mentioned, ti_flags has the wrong scope.
>
> It's *possible* that one could avoid making ti_flags percpu either by 
> extensive use of the runqueue locks or by borrowing a kludge from the 
> idle code.  For the latter, right now, the reason that the 
> wake-from-idle code works is that the optimized path only happens if 
> the idle thread/cpu is "polling", and it's impossible for the idle 
> ti_flags to be polling while the CPU isn't actually idle.  We could 
> similarly observe that, if a ti_flags says it's in USER mode *and* is 
> on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone 
> could try shoving the CPU number into ti_flags :-p   (USER means 
> actually user or in the late exit / early entry path.)
>
> Anyway, benefits of this whole approach would include considerably 
> (IMO) increased comprehensibility compared to the current tangled ct 
> code and much more straightforward addition of new things that happen 
> to a target CPU conditionally depending on its mode.  And, if the flags 
> word was actually per cpu, it could be mapped such that 
> SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write 
> (and maybe CR4/invpcid depending on whether a zapped mapping is global) 
> and the flush bit could depend on whether a flush is needed.  And there 
> would be basically no chance that a bug that accessed 
> invalidated-but-not-flushed kernel data could be undetected -- in PTI 
> mode, any such access would page fault!  Similarly, if kernel text 
> pokes deferred the flush and serialization, the only code that could 
> execute before noticing the deferred flush would be the user-CR3 code.
>
> Oh, any another primitive would be possible: one CPU could plausibly 
> execute another CPU's interrupts or soft-irqs or whatever by taking a 
> special lock that would effectively pin the remote CPU in user mode -- 
> you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> the transition on that CPU to kernel mode would then spin on entry to 
> kernel mode and wait for the lock to be released.  This could plausibly 
> get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> anything that needs to synchronize to the remote CPU but does not need 
> to poke its actual architectural state could be executed locally while 
> the remote CPU is pinned.

Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:

struct fancy_cpu_state {
  u32 work; // <-- writable by any CPU
  u32 status; // <-- readable anywhere; writable locally
};

status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)

Exit to user mode:

atomic_set(&my_state->status, USER);

(or, in the lazy case, set to INDETERMINATE instead.)

Entry from user mode, with IRQs off, before switching to kernel CR3:

if (my_state->status == INDETERMINATE) {
  // we were lazy and we never promised to do work atomically.
  atomic_set(&my_state->status, KERNEL);
  this_entry_work = 0;
} else {
  // we were not lazy and we promised we would do work atomically
  atomic exchange the entire state to { .work = 0, .status = KERNEL }
  this_entry_work = (whatever we just read);
}

if (PTI) {
  switch to kernel CR3 *and flush if this_entry_work says to flush*
} else {
  flush if this_entry_work says to flush;
}

do the rest of the work;



I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.


The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)

Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.


Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 17:22   ` Andy Lutomirski
@ 2025-11-14 18:14     ` Paul E. McKenney
  2025-11-14 18:45       ` Andy Lutomirski
  0 siblings, 1 reply; 48+ messages in thread
From: Paul E. McKenney @ 2025-11-14 18:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 8:20 AM, Andy Lutomirski wrote:
> > On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> >> Context
> >> =======
> >>
> >> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> >> pure-userspace application get regularly interrupted by IPIs sent from
> >> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> >> leading to various on_each_cpu() calls, e.g.:
> >>
> >
> >> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> >> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
> >> the callbacks immediately. Anything that only affects kernelspace can wait
> >> until the next user->kernel transition, providing it can be executed "early
> >> enough" in the entry code.
> >>
> >
> > I want to point out that there's another option here, although anyone 
> > trying to implement it would be fighting against quite a lot of history.
> >
> > Logically, each CPU is in one of a handful of states: user mode, idle, 
> > normal kernel mode (possibly subdivided into IRQ, etc), and a handful 
> > of very narrow windows, hopefully uninstrumented and not accessing any 
> > PTEs that might be invalid, in the entry and exit paths where any state 
> > in memory could be out of sync with actual CPU state.  (The latter 
> > includes right after the CPU switches to kernel mode, for example.)  
> > And NMI and MCE and whatever weird "security" entry types that Intel 
> > and AMD love to add.
> >
> > The way the kernel *currently* deals with this has two big historical oddities:
> >
> > 1. The entry and exit code cares about ti_flags, which is per-*task*, 
> > which means that atomically poking it from other CPUs involves the 
> > runqueue lock or other shenanigans (see the idle nr_polling code for 
> > example), and also that it's not accessible from the user page tables 
> > if PTI is on.
> >
> > 2. The actual heavyweight atomic part (context tracking) was built for 
> > RCU, and it's sort or bolted on, and, as you've observed in this 
> > series, it's really quite awkward to do things that aren't RCU using 
> > context tracking.
> >
> > If this were a greenfield project, I think there's a straightforward 
> > approach that's much nicer: stick everything into a single percpu flags 
> > structure.  Imagine we have cpu_flags, which tracks both the current 
> > state of the CPU and what work needs to be done on state changes.  On 
> > exit to user mode, we would atomically set the mode to USER and make 
> > sure we don't touch anything like vmalloc space after that.  On entry 
> > back to kernel mode, we would avoid vmalloc space, etc, then atomically 
> > switch to kernel mode and read out whatever deferred work is needed.  
> > As an optimization, if nothing in the current configuration needs 
> > atomic state tracking, the state could be left at USER_OR_KERNEL and 
> > the overhead of an extra atomic op at entry and exit could be avoided.
> >
> > And RCU would hook into *that* instead of having its own separate set of hooks.

Please note that RCU needs to sample a given CPU's idle state from other
CPUs, and to have pretty heavy-duty ordering guarantees.  This is needed
to avoid RCU needing to wake up idle CPUs on the one hand or relying on
scheduling-clock interrupts waking up idle CPUs on the other.

Or am I missing the point of your suggestion?

> > I think that actually doing this would be a big improvement and would 
> > also be a serious project.  There's a lot of code that would get 
> > touched, and the existing context tracking code is subtle and 
> > confusing.  And, as mentioned, ti_flags has the wrong scope.

Serious care would certainly be needed!  ;-)

> > It's *possible* that one could avoid making ti_flags percpu either by 
> > extensive use of the runqueue locks or by borrowing a kludge from the 
> > idle code.  For the latter, right now, the reason that the 
> > wake-from-idle code works is that the optimized path only happens if 
> > the idle thread/cpu is "polling", and it's impossible for the idle 
> > ti_flags to be polling while the CPU isn't actually idle.  We could 
> > similarly observe that, if a ti_flags says it's in USER mode *and* is 
> > on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone 
> > could try shoving the CPU number into ti_flags :-p   (USER means 
> > actually user or in the late exit / early entry path.)
> >
> > Anyway, benefits of this whole approach would include considerably 
> > (IMO) increased comprehensibility compared to the current tangled ct 
> > code and much more straightforward addition of new things that happen 
> > to a target CPU conditionally depending on its mode.  And, if the flags 
> > word was actually per cpu, it could be mapped such that 
> > SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write 
> > (and maybe CR4/invpcid depending on whether a zapped mapping is global) 
> > and the flush bit could depend on whether a flush is needed.  And there 
> > would be basically no chance that a bug that accessed 
> > invalidated-but-not-flushed kernel data could be undetected -- in PTI 
> > mode, any such access would page fault!  Similarly, if kernel text 
> > pokes deferred the flush and serialization, the only code that could 
> > execute before noticing the deferred flush would be the user-CR3 code.
> >
> > Oh, any another primitive would be possible: one CPU could plausibly 
> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> > special lock that would effectively pin the remote CPU in user mode -- 
> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> > the transition on that CPU to kernel mode would then spin on entry to 
> > kernel mode and wait for the lock to be released.  This could plausibly 
> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> > anything that needs to synchronize to the remote CPU but does not need 
> > to poke its actual architectural state could be executed locally while 
> > the remote CPU is pinned.

It would be necessary to arrange for the remote CPU to remain pinned
while the local CPU executed on its behalf.  Does the above approach
make that happen without re-introducing our current context-tracking
overhead and complexity?

> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
> 
> struct fancy_cpu_state {
>   u32 work; // <-- writable by any CPU
>   u32 status; // <-- readable anywhere; writable locally
> };
> 
> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
> 
> Exit to user mode:
> 
> atomic_set(&my_state->status, USER);

We need ordering in the RCU nohz_full case.  If the grace-period kthread
sees the status as USER, all the preceding KERNEL code's effects must
be visible to the grace-period kthread.

> (or, in the lazy case, set to INDETERMINATE instead.)
> 
> Entry from user mode, with IRQs off, before switching to kernel CR3:
> 
> if (my_state->status == INDETERMINATE) {
>   // we were lazy and we never promised to do work atomically.
>   atomic_set(&my_state->status, KERNEL);
>   this_entry_work = 0;
> } else {
>   // we were not lazy and we promised we would do work atomically
>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
>   this_entry_work = (whatever we just read);
> }

If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
then this works in that if the grace-period kthread sees USER, its prior
references are guaranteed not to see later kernel-mode references from
that CPU.

> if (PTI) {
>   switch to kernel CR3 *and flush if this_entry_work says to flush*
> } else {
>   flush if this_entry_work says to flush;
> }
> 
> do the rest of the work;
> 
> 
> 
> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.

RCU currently needs pretty heavy-duty ordering to reliably detect the
other CPUs' quiescent states without needing to wake them from idle, or,
in the nohz_full case, interrupt their userspace execution.  Not saying
it is impossible, but it will need extreme care.

> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)

RCU *could* do an smp_call_function_single() when the CPU failed
to respond, perhaps in a manner similar to how it already forces a
given CPU out of nohz_full state if that CPU has been executing in the
kernel for too long.  The real-time guys might not be amused, though.
Especially those real-time guys hitting sub-microsecond latencies.

> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
> 
> 
> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.

If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
operations when transitioning between kernel and user execution?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 18:14     ` Paul E. McKenney
@ 2025-11-14 18:45       ` Andy Lutomirski
  2025-11-14 20:03         ` Paul E. McKenney
  2025-11-14 20:06         ` Thomas Gleixner
  0 siblings, 2 replies; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-14 18:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde



On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> 

>> > Oh, any another primitive would be possible: one CPU could plausibly 
>> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> > special lock that would effectively pin the remote CPU in user mode -- 
>> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> > the transition on that CPU to kernel mode would then spin on entry to 
>> > kernel mode and wait for the lock to be released.  This could plausibly 
>> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> > anything that needs to synchronize to the remote CPU but does not need 
>> > to poke its actual architectural state could be executed locally while 
>> > the remote CPU is pinned.
>
> It would be necessary to arrange for the remote CPU to remain pinned
> while the local CPU executed on its behalf.  Does the above approach
> make that happen without re-introducing our current context-tracking
> overhead and complexity?

Using the pseudo-implementation farther down, I think this would be like:

if (my_state->status == INDETERMINATE) {
   // we were lazy and we never promised to do work atomically.
   atomic_set(&my_state->status, KERNEL);
   this_entry_work = 0;
   /* we are definitely not pinned in this path */
} else {
   // we were not lazy and we promised we would do work atomically
   atomic exchange the entire state to { .work = 0, .status = KERNEL }
   this_entry_work = (whatever we just read);
   if (this_entry_work & PINNED) {
     u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
     while (atomic_read(&this_cpu_pin_count)) {
       cpu_relax();
     }
   }
 }

and we'd have something like:

bool try_pin_remote_cpu(int cpu)
{
    u32 *remote_pin_count = ...;
    struct fancy_cpu_state *remote_state = ...;
    atomic_inc(remote_pin_count);  // optimistic

    // Hmm, we do not want that read to get reordered with the inc, so we probably
    // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
    // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
    smp_mb__after_atomic();

    if (atomic_read(&remote_state->status) == USER) {
      // Okay, it's genuinely pinned.
      return true;

      // egads, if this is some arch with very weak ordering,
      // do we need to be concerned that we just took a lock but we
      // just did a relaxed read and therefore a subsequent access
      // that thinks it's locked might appear to precede the load and therefore
      // somehow get surprisingly seen out of order by the target cpu?
      // maybe we wanted atomic_read_acquire above instead?
    } else {
      // We might not have successfully pinned it
      atomic_dec(remote_pin_count);
    }
}

void unpin_remote_cpu(int cpu)
{
    atomic_dec(remote_pin_count();
}

and we'd use it like:

if (try_pin_remote_cpu(cpu)) {
  // do something useful
} else {
  send IPI;
}

but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.

I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)

>
>> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
>> 
>> struct fancy_cpu_state {
>>   u32 work; // <-- writable by any CPU
>>   u32 status; // <-- readable anywhere; writable locally
>> };
>> 
>> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
>> 
>> Exit to user mode:
>> 
>> atomic_set(&my_state->status, USER);
>
> We need ordering in the RCU nohz_full case.  If the grace-period kthread
> sees the status as USER, all the preceding KERNEL code's effects must
> be visible to the grace-period kthread.

Sorry, I'm speaking lazy x86 programmer here.  Maybe I mean atomic_set_release.  I want, roughly, the property that anyone who remotely observes USER can rely on the target cpu subsequently going through the atomic exchange path above.  I think even relaxed ought to be good enough for that one most architectures, but there are some potentially nasty complications involving that fact that this mixes operations on a double word and a single word that's part of the double word.

>
>> (or, in the lazy case, set to INDETERMINATE instead.)
>> 
>> Entry from user mode, with IRQs off, before switching to kernel CR3:
>> 
>> if (my_state->status == INDETERMINATE) {
>>   // we were lazy and we never promised to do work atomically.
>>   atomic_set(&my_state->status, KERNEL);
>>   this_entry_work = 0;
>> } else {
>>   // we were not lazy and we promised we would do work atomically
>>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>   this_entry_work = (whatever we just read);
>> }
>
> If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
> then this works in that if the grace-period kthread sees USER, its prior
> references are guaranteed not to see later kernel-mode references from
> that CPU.

Yep, that's the intent of my pseudocode.  On x86 this would be a plain 64-bit lock xchg -- I don't think cmpxchg is needed.  (I like to write lock even when it's implicit to avoid needing to trust myself to remember precisely which instructions imply it.)

>
>> if (PTI) {
>>   switch to kernel CR3 *and flush if this_entry_work says to flush*
>> } else {
>>   flush if this_entry_work says to flush;
>> }
>> 
>> do the rest of the work;
>> 
>> 
>> 
>> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.
>
> RCU currently needs pretty heavy-duty ordering to reliably detect the
> other CPUs' quiescent states without needing to wake them from idle, or,
> in the nohz_full case, interrupt their userspace execution.  Not saying
> it is impossible, but it will need extreme care.

Is the thingy above heavy-duty enough?  Perhaps more relevantly, could RCU do this *instead* of the current CT hooks on architectures that implement it, and/or could RCU arrange for the ct hooks to be cheap no-ops on architectures that support the thingy above.  By "could" I mean "could be done without absolutely massive refactoring and without the resulting code being an unmaintainable disaster?  I'm sure any sufficiently motivated human or LLM could pull off the unmaintainable disaster version.  I bet I could even do it myself!)

>
>> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)
>
> RCU *could* do an smp_call_function_single() when the CPU failed
> to respond, perhaps in a manner similar to how it already forces a
> given CPU out of nohz_full state if that CPU has been executing in the
> kernel for too long.  The real-time guys might not be amused, though.
> Especially those real-time guys hitting sub-microsecond latencies.

It wouldn't be outrageous to have real-time imply the full USER transition.

>
>> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
>> 
>> 
>> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.
>
> If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
> operations when transitioning between kernel and user execution?

No.  And I even tested approximately this a couple weeks ago for unrelated reasons.  The only unnecessary heavy-weight thing we're doing in a syscall loop with mitigations off is the RDTSC for stack randomization.

But I did that test on a machine that would absolutely benefit from the IPI suppression that the OP is talking about here, and I think it would be really quite nice if a default distro kernel with a more or less default distribution could be easily convinced to run a user thread without interrupting it.  It's a little bit had that NO_HZ_FULL is still an exotic non-default thing, IMO.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 18:45       ` Andy Lutomirski
@ 2025-11-14 20:03         ` Paul E. McKenney
  2025-11-15  0:29           ` Andy Lutomirski
  2025-11-14 20:06         ` Thomas Gleixner
  1 sibling, 1 reply; 48+ messages in thread
From: Paul E. McKenney @ 2025-11-14 20:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> >> 
> 
> >> > Oh, any another primitive would be possible: one CPU could plausibly 
> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> >> > special lock that would effectively pin the remote CPU in user mode -- 
> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> >> > the transition on that CPU to kernel mode would then spin on entry to 
> >> > kernel mode and wait for the lock to be released.  This could plausibly 
> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> >> > anything that needs to synchronize to the remote CPU but does not need 
> >> > to poke its actual architectural state could be executed locally while 
> >> > the remote CPU is pinned.
> >
> > It would be necessary to arrange for the remote CPU to remain pinned
> > while the local CPU executed on its behalf.  Does the above approach
> > make that happen without re-introducing our current context-tracking
> > overhead and complexity?
> 
> Using the pseudo-implementation farther down, I think this would be like:
> 
> if (my_state->status == INDETERMINATE) {
>    // we were lazy and we never promised to do work atomically.
>    atomic_set(&my_state->status, KERNEL);
>    this_entry_work = 0;
>    /* we are definitely not pinned in this path */
> } else {
>    // we were not lazy and we promised we would do work atomically
>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>    this_entry_work = (whatever we just read);
>    if (this_entry_work & PINNED) {
>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>      while (atomic_read(&this_cpu_pin_count)) {
>        cpu_relax();
>      }
>    }
>  }
> 
> and we'd have something like:
> 
> bool try_pin_remote_cpu(int cpu)
> {
>     u32 *remote_pin_count = ...;
>     struct fancy_cpu_state *remote_state = ...;
>     atomic_inc(remote_pin_count);  // optimistic
> 
>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>     smp_mb__after_atomic();
> 
>     if (atomic_read(&remote_state->status) == USER) {
>       // Okay, it's genuinely pinned.
>       return true;
> 
>       // egads, if this is some arch with very weak ordering,
>       // do we need to be concerned that we just took a lock but we
>       // just did a relaxed read and therefore a subsequent access
>       // that thinks it's locked might appear to precede the load and therefore
>       // somehow get surprisingly seen out of order by the target cpu?
>       // maybe we wanted atomic_read_acquire above instead?
>     } else {
>       // We might not have successfully pinned it
>       atomic_dec(remote_pin_count);
>     }
> }
> 
> void unpin_remote_cpu(int cpu)
> {
>     atomic_dec(remote_pin_count();
> }
> 
> and we'd use it like:
> 
> if (try_pin_remote_cpu(cpu)) {
>   // do something useful
> } else {
>   send IPI;
> }
> 
> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
> 
> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)

Let's start with requirements, non-traditional though that might be. ;-)

An "RCU idle" CPU is either in deep idle or executing in nohz_full
userspace.

1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:

	a.	Everything that this CPU did before entering RCU-idle
		state must be visible to sufficiently later code executed
		by the grace-period kthread, and:

	b.	Everything that the CPU will do after exiting RCU-idle
		state must *not* have been visible to the grace-period
		kthread sufficiently prior to having sampled this
		CPU's state.

2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
	it depends on whether this is the same nonidle sojourn as was
	initially seen.  (If the kthread initially saw the CPU in an
	RCU-idle state, it would not have bothered resampling.)

	a.	If this is the same nonidle sojourn, then there are no
		ordering requirements.	RCU must continue to wait on
		this CPU.

	b.	Otherwise, everything that this CPU did before entering
		its last RCU-idle state must be visible to sufficiently
		later code executed by the grace-period kthread.
		Similar to (1a) above.

3.	If a given CPU quickly switches into and out of RCU-idle
	state, and it is always in RCU-nonidle state whenever the RCU
	grace-period kthread looks, RCU must still realize that this
	CPU has passed through at least one quiescent state.

	This is why we have a counter for RCU rather than just a
	simple state.

The usual way to handle (1a) and (2b) is make the update marking entry
to the RCU-idle state have release semantics and to make the operation
that the RCU grace-period kthread uses to sample the CPU's state have
acquire semantics.

The usual way to handle (1b) is to have a full barrier after the update
marking exit from the RCU-idle state and another full barrier before the
operation that the RCU grace-period kthread uses to sample the CPU's
state.  The "sufficiently" allows some wiggle room on the placement
of both full barriers.  A full barrier can be smp_mb() or some fully
ordered atomic operation.

I will let Valentin and Frederic check the current code.  ;-)

> >> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
> >> 
> >> struct fancy_cpu_state {
> >>   u32 work; // <-- writable by any CPU
> >>   u32 status; // <-- readable anywhere; writable locally
> >> };
> >> 
> >> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
> >> 
> >> Exit to user mode:
> >> 
> >> atomic_set(&my_state->status, USER);
> >
> > We need ordering in the RCU nohz_full case.  If the grace-period kthread
> > sees the status as USER, all the preceding KERNEL code's effects must
> > be visible to the grace-period kthread.
> 
> Sorry, I'm speaking lazy x86 programmer here.  Maybe I mean atomic_set_release.  I want, roughly, the property that anyone who remotely observes USER can rely on the target cpu subsequently going through the atomic exchange path above.  I think even relaxed ought to be good enough for that one most architectures, but there are some potentially nasty complications involving that fact that this mixes operations on a double word and a single word that's part of the double word.

Please see the requirements laid out above.

> >> (or, in the lazy case, set to INDETERMINATE instead.)
> >> 
> >> Entry from user mode, with IRQs off, before switching to kernel CR3:
> >> 
> >> if (my_state->status == INDETERMINATE) {
> >>   // we were lazy and we never promised to do work atomically.
> >>   atomic_set(&my_state->status, KERNEL);
> >>   this_entry_work = 0;
> >> } else {
> >>   // we were not lazy and we promised we would do work atomically
> >>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
> >>   this_entry_work = (whatever we just read);
> >> }
> >
> > If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
> > then this works in that if the grace-period kthread sees USER, its prior
> > references are guaranteed not to see later kernel-mode references from
> > that CPU.
> 
> Yep, that's the intent of my pseudocode.  On x86 this would be a plain 64-bit lock xchg -- I don't think cmpxchg is needed.  (I like to write lock even when it's implicit to avoid needing to trust myself to remember precisely which instructions imply it.)

OK, then the atomic exchange provides all the ordering that the update
ever needs.

> >> if (PTI) {
> >>   switch to kernel CR3 *and flush if this_entry_work says to flush*
> >> } else {
> >>   flush if this_entry_work says to flush;
> >> }
> >> 
> >> do the rest of the work;
> >> 
> >> 
> >> 
> >> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.
> >
> > RCU currently needs pretty heavy-duty ordering to reliably detect the
> > other CPUs' quiescent states without needing to wake them from idle, or,
> > in the nohz_full case, interrupt their userspace execution.  Not saying
> > it is impossible, but it will need extreme care.
> 
> Is the thingy above heavy-duty enough?  Perhaps more relevantly, could RCU do this *instead* of the current CT hooks on architectures that implement it, and/or could RCU arrange for the ct hooks to be cheap no-ops on architectures that support the thingy above.  By "could" I mean "could be done without absolutely massive refactoring and without the resulting code being an unmaintainable disaster?  I'm sure any sufficiently motivated human or LLM could pull off the unmaintainable disaster version.  I bet I could even do it myself!)

I write broken and unmaintainable code all the time, having more than 50
years of experience doing so.  This is one reason we have rcutorture.  ;-)

RCU used to do its own CT-like hooks.  Merging them into the actual
context-tracking code reduced the entry/exit overhead significantly.

I am not seeing how the third requirement above is met, though.  I have
not verified the sampling code that the RCU grace-period kthread is
supposed to use because I am not seeing it right off-hand.

> >> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)
> >
> > RCU *could* do an smp_call_function_single() when the CPU failed
> > to respond, perhaps in a manner similar to how it already forces a
> > given CPU out of nohz_full state if that CPU has been executing in the
> > kernel for too long.  The real-time guys might not be amused, though.
> > Especially those real-time guys hitting sub-microsecond latencies.
> 
> It wouldn't be outrageous to have real-time imply the full USER transition.

We could have special code for RT, but this of course increases the
complexity.  Which might be justified by sufficient speedup.

> >> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
> >> 
> >> 
> >> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.
> >
> > If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
> > operations when transitioning between kernel and user execution?
> 
> No.  And I even tested approximately this a couple weeks ago for unrelated reasons.  The only unnecessary heavy-weight thing we're doing in a syscall loop with mitigations off is the RDTSC for stack randomization.

OK, so why not just build your kernels with CONFIG_NO_HZ_FULL=n and be happy?

> But I did that test on a machine that would absolutely benefit from the IPI suppression that the OP is talking about here, and I think it would be really quite nice if a default distro kernel with a more or less default distribution could be easily convinced to run a user thread without interrupting it.  It's a little bit had that NO_HZ_FULL is still an exotic non-default thing, IMO.

And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
suggesting additional static branches.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 18:45       ` Andy Lutomirski
  2025-11-14 20:03         ` Paul E. McKenney
@ 2025-11-14 20:06         ` Thomas Gleixner
  1 sibling, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2025-11-14 20:06 UTC (permalink / raw)
  To: Andy Lutomirski, Paul E. McKenney
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14 2025 at 10:45, Andy Lutomirski wrote:
> But I did that test on a machine that would absolutely benefit from
> the IPI suppression that the OP is talking about here, and I think it
> would be really quite nice if a default distro kernel with a more or
> less default distribution could be easily convinced to run a user
> thread without interrupting it.  It's a little bit had that NO_HZ_FULL
> is still an exotic non-default thing, IMO.

Feel free to help with getting it over the finish line. Feeling sad/bad
(or whatever you wanted to say with 'had') does not change anything as
you should know :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-14 20:03         ` Paul E. McKenney
@ 2025-11-15  0:29           ` Andy Lutomirski
  2025-11-15  2:30             ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-15  0:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde



On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
>> 
>> 
>> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
>> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> >> 
>> 
>> >> > Oh, any another primitive would be possible: one CPU could plausibly 
>> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> >> > special lock that would effectively pin the remote CPU in user mode -- 
>> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> >> > the transition on that CPU to kernel mode would then spin on entry to 
>> >> > kernel mode and wait for the lock to be released.  This could plausibly 
>> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> >> > anything that needs to synchronize to the remote CPU but does not need 
>> >> > to poke its actual architectural state could be executed locally while 
>> >> > the remote CPU is pinned.
>> >
>> > It would be necessary to arrange for the remote CPU to remain pinned
>> > while the local CPU executed on its behalf.  Does the above approach
>> > make that happen without re-introducing our current context-tracking
>> > overhead and complexity?
>> 
>> Using the pseudo-implementation farther down, I think this would be like:
>> 
>> if (my_state->status == INDETERMINATE) {
>>    // we were lazy and we never promised to do work atomically.
>>    atomic_set(&my_state->status, KERNEL);
>>    this_entry_work = 0;
>>    /* we are definitely not pinned in this path */
>> } else {
>>    // we were not lazy and we promised we would do work atomically
>>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>    this_entry_work = (whatever we just read);
>>    if (this_entry_work & PINNED) {
>>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>>      while (atomic_read(&this_cpu_pin_count)) {
>>        cpu_relax();
>>      }
>>    }
>>  }
>> 
>> and we'd have something like:
>> 
>> bool try_pin_remote_cpu(int cpu)
>> {
>>     u32 *remote_pin_count = ...;
>>     struct fancy_cpu_state *remote_state = ...;
>>     atomic_inc(remote_pin_count);  // optimistic
>> 
>>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>>     smp_mb__after_atomic();
>> 
>>     if (atomic_read(&remote_state->status) == USER) {
>>       // Okay, it's genuinely pinned.
>>       return true;
>> 
>>       // egads, if this is some arch with very weak ordering,
>>       // do we need to be concerned that we just took a lock but we
>>       // just did a relaxed read and therefore a subsequent access
>>       // that thinks it's locked might appear to precede the load and therefore
>>       // somehow get surprisingly seen out of order by the target cpu?
>>       // maybe we wanted atomic_read_acquire above instead?
>>     } else {
>>       // We might not have successfully pinned it
>>       atomic_dec(remote_pin_count);
>>     }
>> }
>> 
>> void unpin_remote_cpu(int cpu)
>> {
>>     atomic_dec(remote_pin_count();
>> }
>> 
>> and we'd use it like:
>> 
>> if (try_pin_remote_cpu(cpu)) {
>>   // do something useful
>> } else {
>>   send IPI;
>> }
>> 
>> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
>> 
>> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
>
> Let's start with requirements, non-traditional though that might be. ;-)

That's ridiculous! :-)

>
> An "RCU idle" CPU is either in deep idle or executing in nohz_full
> userspace.
>
> 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
>
> 	a.	Everything that this CPU did before entering RCU-idle
> 		state must be visible to sufficiently later code executed
> 		by the grace-period kthread, and:
>
> 	b.	Everything that the CPU will do after exiting RCU-idle
> 		state must *not* have been visible to the grace-period
> 		kthread sufficiently prior to having sampled this
> 		CPU's state.
>
> 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> 	it depends on whether this is the same nonidle sojourn as was
> 	initially seen.  (If the kthread initially saw the CPU in an
> 	RCU-idle state, it would not have bothered resampling.)
>
> 	a.	If this is the same nonidle sojourn, then there are no
> 		ordering requirements.	RCU must continue to wait on
> 		this CPU.
>
> 	b.	Otherwise, everything that this CPU did before entering
> 		its last RCU-idle state must be visible to sufficiently
> 		later code executed by the grace-period kthread.
> 		Similar to (1a) above.
>
> 3.	If a given CPU quickly switches into and out of RCU-idle
> 	state, and it is always in RCU-nonidle state whenever the RCU
> 	grace-period kthread looks, RCU must still realize that this
> 	CPU has passed through at least one quiescent state.
>
> 	This is why we have a counter for RCU rather than just a
> 	simple state.

...

> I am not seeing how the third requirement above is met, though.  I have
> not verified the sampling code that the RCU grace-period kthread is
> supposed to use because I am not seeing it right off-hand.

This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.


>> It wouldn't be outrageous to have real-time imply the full USER transition.
>
> We could have special code for RT, but this of course increases the
> complexity.  Which might be justified by sufficient speedup.

I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
>
> And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> suggesting additional static branches.  ;-)

I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

I will ponder.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-11-15  0:29           ` Andy Lutomirski
@ 2025-11-15  2:30             ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2025-11-15  2:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra (Intel), Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14, 2025 at 04:29:31PM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
> >> 
> >> 
> >> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> >> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> >> >> 
> >> 
> >> >> > Oh, any another primitive would be possible: one CPU could plausibly 
> >> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> >> >> > special lock that would effectively pin the remote CPU in user mode -- 
> >> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> >> >> > the transition on that CPU to kernel mode would then spin on entry to 
> >> >> > kernel mode and wait for the lock to be released.  This could plausibly 
> >> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> >> >> > anything that needs to synchronize to the remote CPU but does not need 
> >> >> > to poke its actual architectural state could be executed locally while 
> >> >> > the remote CPU is pinned.
> >> >
> >> > It would be necessary to arrange for the remote CPU to remain pinned
> >> > while the local CPU executed on its behalf.  Does the above approach
> >> > make that happen without re-introducing our current context-tracking
> >> > overhead and complexity?
> >> 
> >> Using the pseudo-implementation farther down, I think this would be like:
> >> 
> >> if (my_state->status == INDETERMINATE) {
> >>    // we were lazy and we never promised to do work atomically.
> >>    atomic_set(&my_state->status, KERNEL);
> >>    this_entry_work = 0;
> >>    /* we are definitely not pinned in this path */
> >> } else {
> >>    // we were not lazy and we promised we would do work atomically
> >>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
> >>    this_entry_work = (whatever we just read);
> >>    if (this_entry_work & PINNED) {
> >>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
> >>      while (atomic_read(&this_cpu_pin_count)) {
> >>        cpu_relax();
> >>      }
> >>    }
> >>  }
> >> 
> >> and we'd have something like:
> >> 
> >> bool try_pin_remote_cpu(int cpu)
> >> {
> >>     u32 *remote_pin_count = ...;
> >>     struct fancy_cpu_state *remote_state = ...;
> >>     atomic_inc(remote_pin_count);  // optimistic
> >> 
> >>     // Hmm, we do not want that read to get reordered with the inc, so we probably
> >>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
> >>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
> >>     smp_mb__after_atomic();
> >> 
> >>     if (atomic_read(&remote_state->status) == USER) {
> >>       // Okay, it's genuinely pinned.
> >>       return true;
> >> 
> >>       // egads, if this is some arch with very weak ordering,
> >>       // do we need to be concerned that we just took a lock but we
> >>       // just did a relaxed read and therefore a subsequent access
> >>       // that thinks it's locked might appear to precede the load and therefore
> >>       // somehow get surprisingly seen out of order by the target cpu?
> >>       // maybe we wanted atomic_read_acquire above instead?
> >>     } else {
> >>       // We might not have successfully pinned it
> >>       atomic_dec(remote_pin_count);
> >>     }
> >> }
> >> 
> >> void unpin_remote_cpu(int cpu)
> >> {
> >>     atomic_dec(remote_pin_count();
> >> }
> >> 
> >> and we'd use it like:
> >> 
> >> if (try_pin_remote_cpu(cpu)) {
> >>   // do something useful
> >> } else {
> >>   send IPI;
> >> }
> >> 
> >> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
> >> 
> >> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
> >
> > Let's start with requirements, non-traditional though that might be. ;-)
> 
> That's ridiculous! :-)

;-) ;-) ;-)

> > An "RCU idle" CPU is either in deep idle or executing in nohz_full
> > userspace.
> >
> > 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
> >
> > 	a.	Everything that this CPU did before entering RCU-idle
> > 		state must be visible to sufficiently later code executed
> > 		by the grace-period kthread, and:
> >
> > 	b.	Everything that the CPU will do after exiting RCU-idle
> > 		state must *not* have been visible to the grace-period
> > 		kthread sufficiently prior to having sampled this
> > 		CPU's state.
> >
> > 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> > 	it depends on whether this is the same nonidle sojourn as was
> > 	initially seen.  (If the kthread initially saw the CPU in an
> > 	RCU-idle state, it would not have bothered resampling.)
> >
> > 	a.	If this is the same nonidle sojourn, then there are no
> > 		ordering requirements.	RCU must continue to wait on
> > 		this CPU.
> >
> > 	b.	Otherwise, everything that this CPU did before entering
> > 		its last RCU-idle state must be visible to sufficiently
> > 		later code executed by the grace-period kthread.
> > 		Similar to (1a) above.
> >
> > 3.	If a given CPU quickly switches into and out of RCU-idle
> > 	state, and it is always in RCU-nonidle state whenever the RCU
> > 	grace-period kthread looks, RCU must still realize that this
> > 	CPU has passed through at least one quiescent state.
> >
> > 	This is why we have a counter for RCU rather than just a
> > 	simple state.

Sigh...

4.	Entering and exiting either idle or nohz_full usermode execution
	at task level can race with interrupts and NMIs attempting to
	make this same transition.  (And you had much to do with the
	current code that mediates the NMI races!)

> ...
> 
> > I am not seeing how the third requirement above is met, though.  I have
> > not verified the sampling code that the RCU grace-period kthread is
> > supposed to use because I am not seeing it right off-hand.
> 
> This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.

Yes.

Also, the ordering can sometimes be implemnted at a distance, but it
really does have to be there.

> >> It wouldn't be outrageous to have real-time imply the full USER transition.
> >
> > We could have special code for RT, but this of course increases the
> > complexity.  Which might be justified by sufficient speedup.
> 
> I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
> >
> > And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> > suggesting additional static branches.  ;-)
> 
> I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

Both sound like they would be nice to have...

> I will ponder.

...give or take feasibility, of course!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
@ 2025-11-19 14:31   ` Andy Lutomirski
  2025-11-19 15:44     ` Valentin Schneider
  0 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-19 14:31 UTC (permalink / raw)
  To: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
> Deferring kernel range TLB flushes requires the guarantee that upon
> entering the kernel, no stale entry may be accessed. The simplest way to
> provide such a guarantee is to issue an unconditional flush upon switching
> to the kernel CR3, as this is the pivoting point where such stale entries
> may be accessed.
>

Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.

Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.

We do need to watch out for NMI/MCE hitting before we flush.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-11-19 14:31   ` Andy Lutomirski
@ 2025-11-19 15:44     ` Valentin Schneider
  2025-11-19 17:31       ` Andy Lutomirski
  0 siblings, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-11-19 15:44 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

On 19/11/25 06:31, Andy Lutomirski wrote:
> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>> Deferring kernel range TLB flushes requires the guarantee that upon
>> entering the kernel, no stale entry may be accessed. The simplest way to
>> provide such a guarantee is to issue an unconditional flush upon switching
>> to the kernel CR3, as this is the pivoting point where such stale entries
>> may be accessed.
>>
>
> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>
> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>

So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
just like sync_core().

My biggest issue with it was that I couldn't figure out a way to instrument
memory accesses such that I would get an idea of where vmalloc'd accesses
happen - even with a hackish thing just to survey the landscape. So while I
agree with your reasoning wrt entry noinstr code, I don't have any way to
prove it.
That's unlike the text_poke sync_core() deferral for which I have all of
that nice objtool instrumentation.

Dave also pointed out that the whole stale entry flush deferral is a risky
move, and that the sanest thing would be to execute the deferred flush just
after switching to the kernel CR3.

See the thread surrounding:
  https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/

mainly Dave's reply and subthread:
  https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/

> We do need to watch out for NMI/MCE hitting before we flush.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-11-19 15:44     ` Valentin Schneider
@ 2025-11-19 17:31       ` Andy Lutomirski
  2025-11-21 10:12         ` Valentin Schneider
  0 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-19 17:31 UTC (permalink / raw)
  To: Valentin Schneider, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

On Wed, Nov 19, 2025, at 7:44 AM, Valentin Schneider wrote:
> On 19/11/25 06:31, Andy Lutomirski wrote:
>> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>>> Deferring kernel range TLB flushes requires the guarantee that upon
>>> entering the kernel, no stale entry may be accessed. The simplest way to
>>> provide such a guarantee is to issue an unconditional flush upon switching
>>> to the kernel CR3, as this is the pivoting point where such stale entries
>>> may be accessed.
>>>
>>
>> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>>
>> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>>
>
> So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
> just like sync_core().
>
> My biggest issue with it was that I couldn't figure out a way to instrument
> memory accesses such that I would get an idea of where vmalloc'd accesses
> happen - even with a hackish thing just to survey the landscape. So while I
> agree with your reasoning wrt entry noinstr code, I don't have any way to
> prove it.
> That's unlike the text_poke sync_core() deferral for which I have all of
> that nice objtool instrumentation.
>
> Dave also pointed out that the whole stale entry flush deferral is a risky
> move, and that the sanest thing would be to execute the deferred flush just
> after switching to the kernel CR3.
>
> See the thread surrounding:
>   https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/
>
> mainly Dave's reply and subthread:
>   https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/
>
>> We do need to watch out for NMI/MCE hitting before we flush.

I read a decent fraction of that thread.

Let's consider what we're worried about:

1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.

2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)

3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.

4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.

5. Writes to page table dirty bits.  I don't think we use these.

In any case, the current implementation in your series is really, really, utterly horrifically slow.  It's probably fine for a task that genuinely sits in usermode forever, but I don't think it's likely to be something that we'd be willing to enable for normal kernels and normal tasks.  And it would be really nice for the don't-interrupt-user-code still to move toward being always available rather than further from it.

I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't need any of this.  Some Intel CPUs support RAR and will eventually be able to use RAR, possibly even for sync_core().

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
@ 2025-11-19 18:31   ` Dave Hansen
  2025-11-19 18:33     ` Andy Lutomirski
  2025-11-21 17:37     ` Valentin Schneider
  0 siblings, 2 replies; 48+ messages in thread
From: Dave Hansen @ 2025-11-19 18:31 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, linux-mm, rcu, x86,
	linux-arm-kernel, loongarch, linux-riscv, linux-arch,
	linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On 11/14/25 07:14, Valentin Schneider wrote:
> +static bool flush_tlb_kernel_cond(int cpu, void *info)
> +{
> +	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
> +	       per_cpu(kernel_cr3_loaded, cpu);
> +}

Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part
of the instruction that actually sets CR3, there's a window between when
'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written.

Is that OK?

It seems like it could lead to both unnecessary IPIs being sent and for
IPIs to be missed.

I still _really_ wish folks would be willing to get newer CPUs to get
this behavior rather than going through all this complexity. RAR in
particular was *specifically* designed to keep TLB flushing IPIs from
blipping userspace for too long.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-11-19 18:31   ` Dave Hansen
@ 2025-11-19 18:33     ` Andy Lutomirski
  2025-11-21 17:37     ` Valentin Schneider
  1 sibling, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2025-11-19 18:33 UTC (permalink / raw)
  To: Dave Hansen, Valentin Schneider, Linux Kernel Mailing List,
	linux-mm, rcu, the arch/x86 maintainers, linux-arm-kernel,
	loongarch, linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde



On Wed, Nov 19, 2025, at 10:31 AM, Dave Hansen wrote:
> On 11/14/25 07:14, Valentin Schneider wrote:
>> +static bool flush_tlb_kernel_cond(int cpu, void *info)
>> +{
>> +	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
>> +	       per_cpu(kernel_cr3_loaded, cpu);
>> +}
>
> Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part
> of the instruction that actually sets CR3, there's a window between when
> 'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written.
>
> Is that OK?
>
> It seems like it could lead to both unnecessary IPIs being sent and for
> IPIs to be missed.

I read the code earlier today and I *think* it’s maybe okay. It’s quite confusing that this thing is split among multiple patches, and the memory ordering issues need comments.

The fact that the big flush is basically unconditional at this point helps. The fact that it’s tangled up with CR3 even though the current implementation has nothing to do with CR3 does not help.

I’m kind of with dhansen though — the fact that the implementation is so nasty coupled with the fact that modern CPUs can do this in hardware makes the whole thing kind of unpalatable.

>
> I still _really_ wish folks would be willing to get newer CPUs to get
> this behavior rather than going through all this complexity. RAR in
> particular was *specifically* designed to keep TLB flushing IPIs from
> blipping userspace for too long.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-11-19 17:31       ` Andy Lutomirski
@ 2025-11-21 10:12         ` Valentin Schneider
  0 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-11-21 10:12 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, rcu,
	the arch/x86 maintainers, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra (Intel), Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik, Shrikanth Hegde

On 19/11/25 09:31, Andy Lutomirski wrote:
> Let's consider what we're worried about:
>
> 1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.
>
> 2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)
>
> 3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.
>
> 4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.
>
> 5. Writes to page table dirty bits.  I don't think we use these.
>
> In any case, the current implementation in your series is really, really,
> utterly horrifically slow.

Quite so :-)

> It's probably fine for a task that genuinely sits in usermode forever,
> but I don't think it's likely to be something that we'd be willing to
> enable for normal kernels and normal tasks.  And it would be really nice
> for the don't-interrupt-user-code still to move toward being always
> available rather than further from it.
>

Well following Frederic's suggestion of using the "is NOHZ_FULL actually in
use" static key in the ASM bits, none of the ugly bits get involved unless
you do have 'nohz_full=' on the cmdline - not perfect, but it's something.

RHEL kernels ship with NO_HZ_FULL=y [1], so we do care about that not impacting
performance too much if it's just compiled-in and not actually used.

[1]: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/blob/main/redhat/configs/common/generic/CONFIG_NO_HZ_FULL
>
> I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't
> need any of this.  Some Intel CPUs support RAR and will eventually be
> able to use RAR, possibly even for sync_core().

Yeah that INVLPGB thing looks really nice, and AFAICT arm64 is similarly
covered with TLBI VMALLE1IS.

My goal here is to poke around and find out what's the minimal amount of
ugly we can get away with to suppress those IPIs on existing fleets, but
there's still too much ugly :/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-11-19 18:31   ` Dave Hansen
  2025-11-19 18:33     ` Andy Lutomirski
@ 2025-11-21 17:37     ` Valentin Schneider
  2025-11-21 17:50       ` Dave Hansen
  1 sibling, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-11-21 17:37 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm, rcu, x86, linux-arm-kernel,
	loongarch, linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On 19/11/25 10:31, Dave Hansen wrote:
> On 11/14/25 07:14, Valentin Schneider wrote:
>> +static bool flush_tlb_kernel_cond(int cpu, void *info)
>> +{
>> +	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
>> +	       per_cpu(kernel_cr3_loaded, cpu);
>> +}
>
> Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part
> of the instruction that actually sets CR3, there's a window between when
> 'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written.
>
> Is that OK?
>
> It seems like it could lead to both unnecessary IPIs being sent and for
> IPIs to be missed.
>

So the pattern is

  SWITCH_TO_KERNEL_CR3
  FLUSH
  KERNEL_CR3_LOADED := 1

  KERNEL_CR3_LOADED := 0
  SWITCH_TO_USER_CR3


The 0 -> 1 transition has a window between the unconditional flush and the
write to 1 where a remote flush IPI may be omitted. Given that the write is
immediately following the unconditional flush, that would really be just
two flushes racing with each other, but I could punt the kernel_cr3_loaded
write above the unconditional flush.

The 1 -> 0 transition is less problematic, worst case a remote flush races
with the CPU returning to userspace and it'll get interrupted back to
kernelspace.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-11-21 17:37     ` Valentin Schneider
@ 2025-11-21 17:50       ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2025-11-21 17:50 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, linux-mm, rcu, x86,
	linux-arm-kernel, loongarch, linux-riscv, linux-arch,
	linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik, Shrikanth Hegde

On 11/21/25 09:37, Valentin Schneider wrote:
> On 19/11/25 10:31, Dave Hansen wrote:
>> On 11/14/25 07:14, Valentin Schneider wrote:
>>> +static bool flush_tlb_kernel_cond(int cpu, void *info)
>>> +{
>>> +	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
>>> +	       per_cpu(kernel_cr3_loaded, cpu);
>>> +}
>>
>> Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part
>> of the instruction that actually sets CR3, there's a window between when
>> 'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written.
>>
>> Is that OK?
>>
>> It seems like it could lead to both unnecessary IPIs being sent and for
>> IPIs to be missed.
>>
> 
> So the pattern is
> 
>   SWITCH_TO_KERNEL_CR3
>   FLUSH
>   KERNEL_CR3_LOADED := 1
> 
>   KERNEL_CR3_LOADED := 0
>   SWITCH_TO_USER_CR3
> 
> 
> The 0 -> 1 transition has a window between the unconditional flush and the
> write to 1 where a remote flush IPI may be omitted. Given that the write is
> immediately following the unconditional flush, that would really be just
> two flushes racing with each other,

Let me fix that for you. When you wrote "a remote flush IPI may be
omitted" you meant to write: "there's a bug." ;)

In the end, KERNEL_CR3_LOADED==0 means, "you don't need to send this CPU
flushing IPIs because it will flush the TLB itself before touching
memory that needs a flush".

   SWITCH_TO_KERNEL_CR3
   FLUSH
   // On kernel CR3, *AND* not getting IPIs
   KERNEL_CR3_LOADED := 1

> but I could punt the kernel_cr3_loaded
> write above the unconditional flush.

Yes, that would eliminate the window, as long as the memory ordering is
right. You not only need to have the KERNEL_CR3_LOADED:=1 CPU set that
variable, you need to ensure that it has seen the page table update.

> The 1 -> 0 transition is less problematic, worst case a remote flush races
> with the CPU returning to userspace and it'll get interrupted back to
> kernelspace.

It's also not just "returning to userspace". It could well be *in*
userspace by the point the IPI shows up. It's not the end of the world,
and the window isn't infinitely long. But there certainly is still a
possibility of getting spurious interrupts for the precious NOHZ_FULL
task while it's in userspace.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-11-21 17:50 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
2025-11-19 14:31   ` Andy Lutomirski
2025-11-19 15:44     ` Valentin Schneider
2025-11-19 17:31       ` Andy Lutomirski
2025-11-21 10:12         ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
2025-11-19 18:31   ` Dave Hansen
2025-11-19 18:33     ` Andy Lutomirski
2025-11-21 17:37     ` Valentin Schneider
2025-11-21 17:50       ` Dave Hansen
2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
2025-11-14 17:22   ` Andy Lutomirski
2025-11-14 18:14     ` Paul E. McKenney
2025-11-14 18:45       ` Andy Lutomirski
2025-11-14 20:03         ` Paul E. McKenney
2025-11-15  0:29           ` Andy Lutomirski
2025-11-15  2:30             ` Paul E. McKenney
2025-11-14 20:06         ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).