linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2025-10-10 15:38 Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 01/29] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
                   ` (30 more replies)
  0 siblings, 31 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

The annoying one: TLB flush deferral
====================================

While leveraging the context_tracking subsystem works for deferring things like
kernel text synchronization, it falls apart when it comes to kernel range TLB
flushes. Consider the following execution flow:

  <userspace>
  
  !interrupt!

  SWITCH_TO_KERNEL_CR3        <--- vmalloc range becomes accessible

  idtentry_func_foo()
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush() <--- deferred flush would be done here


Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry.

Dave had suggested hacking up something within SWITCH_TO_KERNEL_CR3 itself,
which is what has been implemented in the new RFC patches.

How bad is it?
==============

Code
++++

I'm happy that the COALESCE_TLBI asm code fits in ~half a screen,
although it open-codes native_write_cr4() without the pinning logic.

I hate the kernel_cr3_loaded signal; it's a kludgy context_tracking.state
duplicate but I need *some* sort of signal to drive the TLB flush deferral and
the context_tracking.state one is set too late in kernel entry. I couldn't
find any fitting existing signals for this.

I'm also unhappy to introduce two different IPI deferral mechanisms. I tried
shoving the text_poke_sync() in KERNEL_SWITCH_CR3, but it got ugly(er) really
fast. 

Performance
+++++++++++

Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
housekeeping CPUs:

o Xeon E5-2699:   base avg 770ns,  patched avg 1340ns (74% increase)
o Xeon E7-8890:   base avg 1040ns, patched avg 1320ns (27% increase)
o Xeon Gold 6248: base avg 270ns,  patched avg 273ns  (.1% increase)

I don't get that last one, I did spend a ridiculous amount of time making sure
the flush was being executed, and AFAICT yes, it was. What I take out of this is
that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
CPUs), and that's something I want to hear thoughts on

Noise
+++++

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.17
o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
o About one interfering IPI just shy of every 2 minutes

v6.17 + patches
o Zilch!

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).

o Patch 22 deals with detecting NOINSTR text in modules

o Patches 23-24 deal with kernel text sync IPIs

o Patches 25-29 deal with kernel range TLB flush IPIs

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v6

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

v5 -> v6
++++++++

o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming

o Added the TLB flush craziness

v4 -> v5
++++++++

o Rebased onto v6.15-rc3
o Collected Reviewed-by

o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)

o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)

o Ditched the vmap TLB flush deferral (for now)  
  

RFCv3 -> v4
+++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Valentin Schneider (26):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in
    .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  module: Add MOD_NOINSTR_TEXT mem_type
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  x86/mm: Make INVPCID type macros available to assembly
  x86/mm/pti: Introduce a kernel/user CR3 software signal
  x86/mm/pti: Implement a TLB flush immediately after a switch to kernel
    CR3
  x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under
    CONFIG_COALESCE_TLBI=y
  x86/entry: Add an option to coalesce TLB flushes

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |  18 +++
 arch/x86/entry/calling.h                      |  36 ++++++
 arch/x86/entry/syscall_64.c                   |   4 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  18 +++
 arch/x86/include/asm/invpcid.h                |  14 ++-
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   6 +
 arch/x86/kernel/alternative.c                 |  39 ++++++-
 arch/x86/kernel/asm-offsets.c                 |   1 +
 arch/x86/kernel/cpu/bugs.c                    |   2 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/vmx/vmx_onhyperv.c               |   2 +-
 arch/x86/mm/tlb.c                             |  34 ++++--
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  54 +++++++--
 include/linux/context_tracking_work.h         |  26 +++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/module.h                        |   6 +-
 include/linux/objtool.h                       |   7 ++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  69 +++++++++++-
 kernel/kprobes.c                              |   8 +-
 kernel/kstack_erase.c                         |   6 +-
 kernel/module/main.c                          |  76 ++++++++++---
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/time/Kconfig                           |   5 +
 mm/vmalloc.c                                  |  34 +++++-
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  15 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 47 files changed, 682 insertions(+), 96 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.51.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v6 01/29] objtool: Make validate_call() recognize indirect calls to pv_ops[]
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 02/29] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.

While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.

Fabricating an intentional warning shows the change; previously:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section

now:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d14f20ef1db13..5f574ab163cb2 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3323,7 +3323,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
 	return false;
 }
 
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn)
 {
 	static char pvname[19];
 	struct reloc *reloc;
@@ -3332,9 +3332,9 @@ static inline const char *call_dest_name(struct instruction *insn)
 	if (insn_call_dest(insn))
 		return insn_call_dest(insn)->name;
 
-	reloc = insn_reloc(NULL, insn);
+	reloc = insn_reloc(file, insn);
 	if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
-		idx = (reloc_addend(reloc) / sizeof(void *));
+		idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *));
 		snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
 		return pvname;
 	}
@@ -3413,17 +3413,19 @@ static int validate_call(struct objtool_file *file,
 {
 	if (state->noinstr && state->instr <= 0 &&
 	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with UACCESS enabled",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->df) {
-		WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with DF set",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 02/29] objtool: Flesh out warning related to pv_ops[] calls
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 01/29] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 03/29] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5f574ab163cb2..fa9e64a38b2b6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3361,7 +3361,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 
 	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
 		if (!target->sec->noinstr) {
-			WARN("pv_ops[%d]: %s", idx, target->name);
+			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
 			file->pv_ops[idx].clean = false;
 		}
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 03/29] rcu: Add a small-width RCU watching counter debug option
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 01/29] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 02/29] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 04/29] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.

Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++----
 kernel/rcu/Kconfig.debug               | 15 +++++++++
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 7b8433d5a8efe..0b81248aa03e2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -18,12 +18,6 @@ enum ctx_state {
 	CT_STATE_MAX		= 4,
 };
 
-/* Odd value for watching, else even. */
-#define CT_RCU_WATCHING CT_STATE_MAX
-
-#define CT_STATE_MASK (CT_STATE_MAX - 1)
-#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
-
 struct context_tracking {
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 	/*
@@ -44,9 +38,45 @@ struct context_tracking {
 #endif
 };
 
+/*
+ * We cram two different things within the same atomic variable:
+ *
+ *                     CT_RCU_WATCHING_START  CT_STATE_START
+ *                                |                |
+ *                                v                v
+ *     MSB [ RCU watching counter ][ context_state ] LSB
+ *         ^                       ^
+ *         |                       |
+ * CT_RCU_WATCHING_END        CT_STATE_END
+ *
+ * Bits are used from the LSB upwards, so unused bits (if any) will always be in
+ * upper bits of the variable.
+ */
 #ifdef CONFIG_CONTEXT_TRACKING
+#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1)
+#define CT_STATE_START 0
+#define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
+#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
+#define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
+
+#define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
+
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+
+static_assert(CT_STATE_WIDTH        +
+	      CT_RCU_WATCHING_WIDTH +
+	      CT_UNUSED_WIDTH       ==
+	      CT_SIZE);
+
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
-#endif
+#endif	/* CONFIG_CONTEXT_TRACKING */
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 static __always_inline int __ct_state(void)
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 12e4c64ebae15..625d75392647b 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -213,4 +213,19 @@ config RCU_STRICT_GRACE_PERIOD
 	  when looking for certain types of RCU usage bugs, for example,
 	  too-short RCU read-side critical sections.
 
+
+config RCU_DYNTICKS_TORTURE
+	bool "Minimize RCU dynticks counter size"
+	depends on RCU_EXPERT && !COMPILE_TEST
+	default n
+	help
+	  This option sets the width of the dynticks counter to its
+	  minimum usable value.  This minimum width greatly increases
+	  the probability of flushing out bugs involving counter wrap,
+	  but it also increases the probability of extending grace period
+	  durations.  This Kconfig option should therefore be avoided in
+	  production due to the consequent increased probability of OOMs.
+
+	  This has no value for production and is only for testing.
+
 endmenu # "RCU Debugging"
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 04/29] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (2 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 03/29] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 05/29] jump_label: Add annotations for validating noinstr usage Valentin Schneider
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

We now have an RCU_EXPERT config for testing small-sized RCU dynticks
counter:  CONFIG_RCU_DYNTICKS_TORTURE.

Modify scenario TREE04 to exercise to use this config in order to test a
ridiculously small counter (2 bits).

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..67caf4276bb01 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 05/29] jump_label: Add annotations for validating noinstr usage
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (3 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 04/29] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 06/29] static_call: Add read-only-after-init static calls Valentin Schneider
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

Some noinstr static branches may really need to be patched at runtime,
despite the resulting disruption.  Add DEFINE_STATIC_KEY_*_NOINSTR()
variants for those.  They don't do anything special yet; that will come
later.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/jump_label.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index fdb79dd1ebd8c..c4f6240ff4d95 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -388,6 +388,23 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE_RO(name)	\
 	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
 
+/*
+ * The _NOINSTR variants are used to tell objtool the static key is allowed to
+ * be used in noinstr code.
+ *
+ * They should almost never be used, as they prevent code patching IPIs from
+ * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running
+ * in pure userspace.
+ *
+ * If using one of these _NOINSTR variants, please add a comment above the
+ * definition with the rationale.
+ */
+#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_TRUE(name)
+
+#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_FALSE(name)
+
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 06/29] static_call: Add read-only-after-init static calls
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (4 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 05/29] jump_label: Add annotations for validating noinstr usage Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-30 10:25   ` Petr Tesarik
  2025-10-10 15:38 ` [PATCH v6 07/29] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

If a noinstr static call only needs to be patched during boot, its key
can be made ro-after-init to ensure it will never be patched at runtime.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/static_call.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 78a77a4ae0ea8..ea6ca57e2a829 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -192,6 +192,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
+#define DEFINE_STATIC_CALL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = _func,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
@@ -200,6 +208,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = NULL,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 07/29] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (5 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 06/29] static_call: Add read-only-after-init static calls Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 08/29] x86/idle: Mark x86_idle " Valentin Schneider
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

pv_sched_clock is updated in:
o __init vmware_paravirt_ops_setup()
o __init xen_init_time_common()
o kvm_sched_clock_init() <- __init kvmclock_init()
o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc693..34b6fa3fcc045 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -69,7 +69,7 @@ static u64 native_steal_clock(int cpu)
 }
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
-DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 08/29] x86/idle: Mark x86_idle static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (6 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 07/29] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 09/29] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

x86_idle is updated in:
o xen_set_default_idle() <- __init xen_arch_setup()
o __init select_idle_routine()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0c..1c2669d9b0396 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -761,7 +761,7 @@ void __cpuidle default_idle(void)
 EXPORT_SYMBOL(default_idle);
 #endif
 
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle);
+DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
 
 static bool x86_idle_set(void)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 09/29] x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (7 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 08/29] x86/idle: Mark x86_idle " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 10/29] riscv/paravirt: " Valentin Schneider
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

The static call is only ever updated in

  __init pv_time_init()
  __init xen_init_time_common()
  __init vmware_paravirt_ops_setup()
  __init xen_time_setup_guest(

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 34b6fa3fcc045..f320a9617b1d6 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -68,7 +68,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 10/29] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (8 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 09/29] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 11/29] loongarch/paravirt: " Valentin Schneider
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Andrew Jones, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

The static call is only ever updated in:

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
---
 arch/riscv/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
index fa6b0339a65de..dfe8808016fd8 100644
--- a/arch/riscv/kernel/paravirt.c
+++ b/arch/riscv/kernel/paravirt.c
@@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 11/29] loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (9 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 10/29] riscv/paravirt: " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 12/29] arm64/paravirt: " Valentin Schneider
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/loongarch/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index b1b51f920b231..9ec3f5c31fdab 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -19,7 +19,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 12/29] arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (10 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 11/29] loongarch/paravirt: " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 13/29] arm/paravirt: " Valentin Schneider
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm64/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index aa718d6a9274a..ad28fa23c9228 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 struct pv_time_stolen_time_region {
 	struct pvclock_vcpu_stolen_time __rcu *kaddr;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 13/29] arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (11 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 12/29] arm64/paravirt: " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 14/29] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

The static call is only ever updated in

  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c
index 7dd9806369fb0..632d8d5e06db3 100644
--- a/arch/arm/kernel/paravirt.c
+++ b/arch/arm/kernel/paravirt.c
@@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 14/29] perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (12 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 13/29] arm/paravirt: " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 15/29] sched/clock: Mark sched_clock_running key " Valentin Schneider
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init
amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/events/amd/brs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 06f35a6b58a5b..71d7ba774a063 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -423,7 +423,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in)
 	}
 }
 
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
+DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
 EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
 
 void __init amd_brs_lopwr_init(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 15/29] sched/clock: Mark sched_clock_running key as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (13 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 14/29] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

sched_clock_running is only ever enabled in the __init functions
sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark
it __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index f5e6dd6a6b3af..c1a028e99d2cd 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -69,7 +69,7 @@ notrace unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
+static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (14 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 15/29] sched/clock: Mark sched_clock_running key " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-14  0:02   ` Sean Christopherson
  2025-10-10 15:38 ` [PATCH v6 17/29] x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in .noinstr Valentin Schneider
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

The static key is only ever enabled in

  __init hv_init_evmcs()

so mark it appropriately as __ro_after_init.

Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.c b/arch/x86/kvm/vmx/vmx_onhyperv.c
index b9a8b91166d02..ff3d80c9565bb 100644
--- a/arch/x86/kvm/vmx/vmx_onhyperv.c
+++ b/arch/x86/kvm/vmx/vmx_onhyperv.c
@@ -3,7 +3,7 @@
 #include "capabilities.h"
 #include "vmx_onhyperv.h"
 
-DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs);
+DEFINE_STATIC_KEY_FALSE_RO(__kvm_is_using_evmcs);
 
 /*
  * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 17/29] x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in .noinstr
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (15 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 18/29] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

cpu_buf_idle_clear is used in .noinstr code, and can be modified at
runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little
benefits for this key, as hotplug implies eventually going through
takedown_cpu() -> stop_machine_cpuslocked() which is going to cause
interference on all online CPUs anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/cpu/bugs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 36dcfc5105be9..3aca329a853b5 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -182,7 +182,7 @@ DEFINE_STATIC_KEY_FALSE(switch_vcpu_ibpb);
 EXPORT_SYMBOL_GPL(switch_vcpu_ibpb);
 
 /* Control CPU buffer clear before idling (halt, mwait) */
-DEFINE_STATIC_KEY_FALSE(cpu_buf_idle_clear);
+DEFINE_STATIC_KEY_FALSE_NOINSTR(cpu_buf_idle_clear);
 EXPORT_SYMBOL_GPL(cpu_buf_idle_clear);
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 18/29] sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (16 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 17/29] x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in .noinstr Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

__sched_clock_stable is used in .noinstr code, and can be modified at
runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync()
IPI has little benefits for this key, as NOHZ_FULL is incompatible with an
unstable TSC anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c1a028e99d2cd..e6ef71d74ac95 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -78,8 +78,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
  *
  * Similarly we start with __sched_clock_stable_early, thereby assuming we
  * will become stable, such that there's only a single 1 -> 0 transition.
+ *
+ * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL,
+ * thus the text patching IPI would be the least of our concerns.
  */
-static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable);
 static int __sched_clock_stable_early = 1;
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (17 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 18/29] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-14  0:01   ` Sean Christopherson
  2025-10-10 15:38 ` [PATCH v6 20/29] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

These keys are used in .noinstr code, and can be modified at runtime
(/proc/kernel/vmx* write). However it is not expected that they will be
flipped during latency-sensitive operations, and thus shouldn't be a source
of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index aa157fe5b7b31..dce2bd7375ec8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -204,8 +204,15 @@ module_param(pt_mode, int, S_IRUGO);
 
 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
 
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
+/*
+ * Both of these static keys end up being used in .noinstr sections, however
+ * they are only modified:
+ * - at init
+ * - from a /proc/kernel/vmx* write
+ * thus during latency-sensitive operations they should remain stable.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 20/29] stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (18 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 21/29] objtool: Add noinstr validation for static branches/calls Valentin Schneider
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

stack_erasing_bypass is used in .noinstr code, and can be modified at runtime
(proc/sys/kernel/stack_erasing write). However it is not expected that it
will be  flipped during latency-sensitive operations, and thus shouldn't be
a source of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/kstack_erase.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/kstack_erase.c b/kernel/kstack_erase.c
index e49bb88b4f0a3..1abc94a38ce14 100644
--- a/kernel/kstack_erase.c
+++ b/kernel/kstack_erase.c
@@ -19,7 +19,11 @@
 #include <linux/sysctl.h>
 #include <linux/init.h>
 
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass);
+/*
+ * This static key can only be modified via its sysctl interface. It is
+ * expected it will remain stable during latency-senstive operations.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
 
 #ifdef CONFIG_SYSCTL
 static int stack_erasing_sysctl(const struct ctl_table *table, int write,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 21/29] objtool: Add noinstr validation for static branches/calls
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (19 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 20/29] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 22/29] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

From: Josh Poimboeuf <jpoimboe@kernel.org>

Warn about static branches/calls in noinstr regions, unless the
corresponding key is RO-after-init or has been manually whitelisted with
DEFINE_STATIC_KEY_*_NOINSTR(().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
[Added NULL check for insn_call_dest() return value]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/jump_label.h              | 17 +++--
 include/linux/objtool.h                 |  7 ++
 include/linux/static_call.h             |  3 +
 tools/objtool/Documentation/objtool.txt | 34 +++++++++
 tools/objtool/check.c                   | 92 ++++++++++++++++++++++---
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/elf.h     |  1 +
 tools/objtool/include/objtool/special.h |  1 +
 tools/objtool/special.c                 | 15 +++-
 9 files changed, 155 insertions(+), 16 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index c4f6240ff4d95..0ea203ebbc493 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -76,6 +76,7 @@
 #include <linux/types.h>
 #include <linux/compiler.h>
 #include <linux/cleanup.h>
+#include <linux/objtool.h>
 
 extern bool static_key_initialized;
 
@@ -376,8 +377,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_TRUE(name)	\
 	struct static_key_true name = STATIC_KEY_TRUE_INIT
 
-#define DEFINE_STATIC_KEY_TRUE_RO(name)	\
-	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_RO(name)						\
+	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_TRUE(name)	\
 	extern struct static_key_true name
@@ -385,8 +387,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE(name)	\
 	struct static_key_false name = STATIC_KEY_FALSE_INIT
 
-#define DEFINE_STATIC_KEY_FALSE_RO(name)	\
-	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_RO(name)					\
+	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 /*
  * The _NOINSTR variants are used to tell objtool the static key is allowed to
@@ -400,10 +403,12 @@ struct static_key_false {
  * definition with the rationale.
  */
 #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_TRUE(name)
+	DEFINE_STATIC_KEY_TRUE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_FALSE(name)
+	DEFINE_STATIC_KEY_FALSE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index 366ad004d794b..2d3661de4cf95 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -34,6 +34,12 @@
 	static void __used __section(".discard.func_stack_frame_non_standard") \
 		*__func_stack_frame_non_standard_##func = func
 
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \
+	static void __used __section(".discard.noinstr_allowed") \
+		*__annotate_noinstr_allowed_##key = &key
+
+#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key)
+
 /*
  * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore
  * for the case where a function is intentionally missing frame pointer setup,
@@ -130,6 +136,7 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define __ASM_ANNOTATE(label, type) ""
 #define ASM_ANNOTATE(type)
+#define ANNOTATE_NOINSTR_ALLOWED(key)
 #else
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
 .endm
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index ea6ca57e2a829..0d4b16d348501 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -133,6 +133,7 @@
 
 #include <linux/types.h>
 #include <linux/cpu.h>
+#include <linux/objtool.h>
 #include <linux/static_call_types.h>
 
 #ifdef CONFIG_HAVE_STATIC_CALL
@@ -198,6 +199,7 @@ extern long __static_call_return0(void);
 		.func = _func,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
@@ -214,6 +216,7 @@ extern long __static_call_return0(void);
 		.func = NULL,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt
index 9e97fc25b2d8a..991e085e10d95 100644
--- a/tools/objtool/Documentation/objtool.txt
+++ b/tools/objtool/Documentation/objtool.txt
@@ -456,6 +456,40 @@ the objtool maintainers.
     these special names and does not use module_init() / module_exit()
     macros to create them.
 
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in noinstr code
+    file.o: warning: func()+0x2a: key: non-RO static call usage in noinstr code
+
+  This means that noinstr function func() uses a static key or
+  static call named 'key' which can be modified at runtime.  This is
+  discouraged because it prevents code patching IPIs from being
+  deferred.
+
+  You have the following options:
+
+  1) Check whether the static key/call in question is only modified
+     during init.  If so, define it as read-only-after-init with
+     DEFINE_STATIC_KEY_*_RO() or DEFINE_STATIC_CALL_RO().
+
+  2) Avoid the runtime patching.  For static keys this can be done by
+     using static_key_enabled() or by getting rid of the static key
+     altogether if performance is not a concern.
+
+     For static calls, something like the following could be done:
+
+       target = static_call_query(foo);
+       if (target == func1)
+	       func1();
+	else if (target == func2)
+		func2();
+	...
+
+  3) Silence the warning by defining the static key/call with
+     DEFINE_STATIC_*_NOINSTR().  This decision should not
+     be taken lightly as it may result in code patching IPIs getting
+     sent to isolated NOHZ_FULL CPUs running in pure userspace.  A
+     comment should be added above the definition explaining the
+     rationale for the decision.
+
 
 If the error doesn't seem to make sense, it could be a bug in objtool.
 Feel free to ask objtool maintainers for help.
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index fa9e64a38b2b6..b749f13251a6f 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -981,6 +981,45 @@ static int create_direct_call_sections(struct objtool_file *file)
 	return 0;
 }
 
+static int read_noinstr_allowed(struct objtool_file *file)
+{
+	struct section *rsec;
+	struct symbol *sym;
+	struct reloc *reloc;
+
+	rsec = find_section_by_name(file->elf, ".rela.discard.noinstr_allowed");
+	if (!rsec)
+		return 0;
+
+	for_each_reloc(rsec, reloc) {
+		switch (reloc->sym->type) {
+		case STT_OBJECT:
+		case STT_FUNC:
+			sym = reloc->sym;
+			break;
+
+		case STT_SECTION:
+			sym = find_symbol_by_offset(reloc->sym->sec,
+						    reloc_addend(reloc));
+			if (!sym) {
+				WARN_FUNC(reloc->sym->sec, reloc_addend(reloc),
+					  "can't find static key/call symbol");
+				return -1;
+			}
+			break;
+
+		default:
+			WARN("unexpected relocation symbol type in %s: %d",
+			     rsec->name, reloc->sym->type);
+			return -1;
+		}
+
+		sym->noinstr_allowed = 1;
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -1867,6 +1906,8 @@ static int handle_jump_alt(struct objtool_file *file,
 		return -1;
 	}
 
+	orig_insn->key = special_alt->key;
+
 	if (opts.hack_jump_label && special_alt->key_addend & 2) {
 		struct reloc *reloc = insn_reloc(file, orig_insn);
 
@@ -2600,6 +2641,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_noinstr_allowed(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -3369,9 +3414,9 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 	return file->pv_ops[idx].clean;
 }
 
-static inline bool noinstr_call_dest(struct objtool_file *file,
-				     struct instruction *insn,
-				     struct symbol *func)
+static inline bool noinstr_call_allowed(struct objtool_file *file,
+					struct instruction *insn,
+					struct symbol *func)
 {
 	/*
 	 * We can't deal with indirect function calls at present;
@@ -3391,10 +3436,10 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 		return true;
 
 	/*
-	 * If the symbol is a static_call trampoline, we can't tell.
+	 * Only DEFINE_STATIC_CALL_*_RO allowed.
 	 */
 	if (func->static_call_tramp)
-		return true;
+		return func->noinstr_allowed;
 
 	/*
 	 * The __ubsan_handle_*() calls are like WARN(), they only happen when
@@ -3407,14 +3452,29 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 	return false;
 }
 
+static char *static_call_name(struct symbol *func)
+{
+	return func->name + strlen("__SCT__");
+}
+
 static int validate_call(struct objtool_file *file,
 			 struct instruction *insn,
 			 struct insn_state *state)
 {
-	if (state->noinstr && state->instr <= 0 &&
-	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
-		return 1;
+	if (state->noinstr && state->instr <= 0) {
+		struct symbol *dest = insn_call_dest(insn);
+
+		if (dest && dest->static_call_tramp) {
+			if (!dest->noinstr_allowed) {
+				WARN_INSN(insn, "%s: non-RO static call usage in noinstr",
+					  static_call_name(dest));
+			}
+
+		} else if (dest && !noinstr_call_allowed(file, insn, dest)) {
+			WARN_INSN(insn, "call to %s() leaves .noinstr.text section",
+				  call_dest_name(file, insn));
+			return 1;
+		}
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
@@ -3479,6 +3539,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
 	return 0;
 }
 
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+	if (state->noinstr && state->instr <= 0 && !insn->key->noinstr_allowed) {
+		WARN_INSN(insn, "%s: non-RO static key usage in noinstr",
+			  insn->key->name);
+		return 1;
+	}
+
+	return 0;
+}
+
 static struct instruction *next_insn_to_validate(struct objtool_file *file,
 						 struct instruction *insn)
 {
@@ -3666,6 +3737,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 		if (handle_insn_ops(insn, next_insn, &state))
 			return 1;
 
+		if (insn->key)
+			validate_static_key(insn, &state);
+
 		switch (insn->type) {
 
 		case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index 00fb745e72339..d79b08f55bcbc 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -81,6 +81,7 @@ struct instruction {
 	struct symbol *sym;
 	struct stack_op *stack_ops;
 	struct cfi_state *cfi;
+	struct symbol *key;
 };
 
 static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h
index 0a2fa3ac00793..acd610ad26f17 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -70,6 +70,7 @@ struct symbol {
 	u8 local_label       : 1;
 	u8 frame_pointer     : 1;
 	u8 ignore	     : 1;
+	u8 noinstr_allowed   : 1;
 	struct list_head pv_target;
 	struct reloc *relocs;
 	struct section *group_sec;
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 72d09c0adf1a1..e84d704f3f20e 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -18,6 +18,7 @@ struct special_alt {
 	bool group;
 	bool jump_or_nop;
 	u8 key_addend;
+	struct symbol *key;
 
 	struct section *orig_sec;
 	unsigned long orig_off;
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index c80fed8a840ee..d77f3fa4bbbc9 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -110,13 +110,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
 
 	if (entry->key) {
 		struct reloc *key_reloc;
+		struct symbol *key;
+		s64 key_addend;
 
 		key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key);
 		if (!key_reloc) {
 			ERROR_FUNC(sec, offset + entry->key, "can't find key reloc");
 			return -1;
 		}
-		alt->key_addend = reloc_addend(key_reloc);
+
+		key = key_reloc->sym;
+		key_addend = reloc_addend(key_reloc);
+
+		if (key->type == STT_SECTION)
+			key = find_symbol_by_offset(key->sec, key_addend & ~3);
+
+		/* embedded keys not supported */
+		if (key) {
+			alt->key = key;
+			alt->key_addend = key_addend;
+		}
 	}
 
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 22/29] module: Add MOD_NOINSTR_TEXT mem_type
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (20 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 21/29] objtool: Add noinstr validation for static branches/calls Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure Valentin Schneider
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

As pointed out by Sean [1], is_kernel_noinstr_text() will return false for
an address contained within a module's .noinstr.text section. A later patch
will require checking whether a text address is noinstr, and this can
unfortunately be the case of modules - KVM is one such case.

A module's .noinstr.text section is already tracked as of commit
  66e9b0717102 ("kprobes: Prevent probes in .noinstr.text section")
for kprobe blacklisting purposes, but via an ad-hoc mechanism.

Add a MOD_NOINSTR_TEXT mem_type, and reorganize __layout_sections() so that
it maps all the sections in a single invocation.

[1]: http://lore.kernel.org/r/Z4qQL89GZ_gk0vpu@google.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/module.h |  6 ++--
 kernel/kprobes.c       |  8 ++---
 kernel/module/main.c   | 76 ++++++++++++++++++++++++++++++++----------
 3 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 3319a5269d286..825e2a072184a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -314,6 +314,7 @@ struct mod_tree_node {
 
 enum mod_mem_type {
 	MOD_TEXT = 0,
+	MOD_NOINSTR_TEXT,
 	MOD_DATA,
 	MOD_RODATA,
 	MOD_RO_AFTER_INIT,
@@ -484,8 +485,6 @@ struct module {
 	void __percpu *percpu;
 	unsigned int percpu_size;
 #endif
-	void *noinstr_text_start;
-	unsigned int noinstr_text_size;
 
 #ifdef CONFIG_TRACEPOINTS
 	unsigned int num_tracepoints;
@@ -614,12 +613,13 @@ static inline bool module_is_coming(struct module *mod)
         return mod->state == MODULE_STATE_COMING;
 }
 
-struct module *__module_text_address(unsigned long addr);
 struct module *__module_address(unsigned long addr);
+struct module *__module_text_address(unsigned long addr);
 bool is_module_address(unsigned long addr);
 bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr);
 bool is_module_percpu_address(unsigned long addr);
 bool is_module_text_address(unsigned long addr);
+bool is_module_noinstr_text_address(unsigned long addr);
 
 static inline bool within_module_mem_type(unsigned long addr,
 					  const struct module *mod,
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index ab8f9fc1f0d17..d60560dddec56 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2551,9 +2551,9 @@ static void add_module_kprobe_blacklist(struct module *mod)
 		kprobe_add_area_blacklist(start, end);
 	}
 
-	start = (unsigned long)mod->noinstr_text_start;
+	start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base;
 	if (start) {
-		end = start + mod->noinstr_text_size;
+		end = start + mod->mem[MOD_NOINSTR_TEXT].size;
 		kprobe_add_area_blacklist(start, end);
 	}
 }
@@ -2574,9 +2574,9 @@ static void remove_module_kprobe_blacklist(struct module *mod)
 		kprobe_remove_area_blacklist(start, end);
 	}
 
-	start = (unsigned long)mod->noinstr_text_start;
+	start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base;
 	if (start) {
-		end = start + mod->noinstr_text_size;
+		end = start + mod->mem[MOD_NOINSTR_TEXT].size;
 		kprobe_remove_area_blacklist(start, end);
 	}
 }
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c66b261849362..1f5bfdbb956a7 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1653,7 +1653,17 @@ bool module_init_layout_section(const char *sname)
 	return module_init_section(sname);
 }
 
-static void __layout_sections(struct module *mod, struct load_info *info, bool is_init)
+static bool module_noinstr_layout_section(const char *sname)
+{
+	return strstarts(sname, ".noinstr");
+}
+
+static bool module_default_layout_section(const char *sname)
+{
+	return !module_init_layout_section(sname) && !module_noinstr_layout_section(sname);
+}
+
+static void __layout_sections(struct module *mod, struct load_info *info)
 {
 	unsigned int m, i;
 
@@ -1662,20 +1672,44 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i
 	 *   Mask of excluded section header flags }
 	 */
 	static const unsigned long masks[][2] = {
+		/* Core */
+		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL },
+		{ SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL },
+		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 },
+		/* Init */
 		{ SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL },
 		{ SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL },
 		{ SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL },
 		{ SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL },
-		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 }
+		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 },
 	};
-	static const int core_m_to_mem_type[] = {
+	static bool (*const section_filter[])(const char *) = {
+		/* Core */
+		module_default_layout_section,
+		module_noinstr_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		module_default_layout_section,
+		/* Init */
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+		module_init_layout_section,
+	};
+	static const int mem_type_map[] = {
+		/* Core */
 		MOD_TEXT,
+		MOD_NOINSTR_TEXT,
 		MOD_RODATA,
 		MOD_RO_AFTER_INIT,
 		MOD_DATA,
 		MOD_DATA,
-	};
-	static const int init_m_to_mem_type[] = {
+		/* Init */
 		MOD_INIT_TEXT,
 		MOD_INIT_RODATA,
 		MOD_INVALID,
@@ -1684,16 +1718,16 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i
 	};
 
 	for (m = 0; m < ARRAY_SIZE(masks); ++m) {
-		enum mod_mem_type type = is_init ? init_m_to_mem_type[m] : core_m_to_mem_type[m];
+		enum mod_mem_type type = mem_type_map[m];
 
 		for (i = 0; i < info->hdr->e_shnum; ++i) {
 			Elf_Shdr *s = &info->sechdrs[i];
 			const char *sname = info->secstrings + s->sh_name;
 
-			if ((s->sh_flags & masks[m][0]) != masks[m][0]
-			    || (s->sh_flags & masks[m][1])
-			    || s->sh_entsize != ~0UL
-			    || is_init != module_init_layout_section(sname))
+			if ((s->sh_flags & masks[m][0]) != masks[m][0] ||
+			    (s->sh_flags & masks[m][1])                ||
+			    s->sh_entsize != ~0UL                      ||
+			    !section_filter[m](sname))
 				continue;
 
 			if (WARN_ON_ONCE(type == MOD_INVALID))
@@ -1733,10 +1767,7 @@ static void layout_sections(struct module *mod, struct load_info *info)
 		info->sechdrs[i].sh_entsize = ~0UL;
 
 	pr_debug("Core section allocation order for %s:\n", mod->name);
-	__layout_sections(mod, info, false);
-
-	pr_debug("Init section allocation order for %s:\n", mod->name);
-	__layout_sections(mod, info, true);
+	__layout_sections(mod, info);
 }
 
 static void module_license_taint_check(struct module *mod, const char *license)
@@ -2625,9 +2656,6 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 	}
 #endif
 
-	mod->noinstr_text_start = section_objs(info, ".noinstr.text", 1,
-						&mod->noinstr_text_size);
-
 #ifdef CONFIG_TRACEPOINTS
 	mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
 					     sizeof(*mod->tracepoints_ptrs),
@@ -3872,12 +3900,26 @@ struct module *__module_text_address(unsigned long addr)
 	if (mod) {
 		/* Make sure it's within the text section. */
 		if (!within_module_mem_type(addr, mod, MOD_TEXT) &&
+		    !within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT) &&
 		    !within_module_mem_type(addr, mod, MOD_INIT_TEXT))
 			mod = NULL;
 	}
 	return mod;
 }
 
+bool is_module_noinstr_text_address(unsigned long addr)
+{
+	scoped_guard(preempt) {
+		struct module *mod = __module_address(addr);
+
+		/* Make sure it's within the .noinstr.text section. */
+		if (mod)
+			return within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT);
+	}
+
+	return false;
+}
+
 /* Don't grab lock, we're oopsing. */
 void print_modules(void)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (21 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 22/29] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-28 14:00   ` Frederic Weisbecker
  2025-10-10 15:38 ` [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
                   ` (7 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Nicolas Saenz Julienne, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.

Presence of CT_RCU_WATCHING in the ct_state prevents queuing deferred work.

Later commits introduce the bit:callback mappings.

Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/Kconfig                                 |  9 +++
 arch/x86/Kconfig                             |  1 +
 arch/x86/include/asm/context_tracking_work.h | 16 +++++
 include/linux/context_tracking.h             | 21 ++++++
 include/linux/context_tracking_state.h       | 30 ++++++---
 include/linux/context_tracking_work.h        | 26 ++++++++
 kernel/context_tracking.c                    | 69 +++++++++++++++++++-
 kernel/time/Kconfig                          |  5 ++
 8 files changed, 165 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index d1b4ffd6e0856..a33229e017467 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -968,6 +968,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 05880301212e3..3f1557b7acd8f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -222,6 +222,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5f3b2d0977235
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(enum ct_work work)
+{
+	switch (work) {
+	case CT_WORK_n:
+		// Do work...
+		break;
+	case CT_WORK_MAX:
+		WARN_ON_ONCE(true);
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..0b0faa040e9b5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
@@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
 	return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long new, old, state;
+
+	state = arch_atomic_read(&ct->state);
+	do {
+		old = state;
+		new = old & ~CT_WORK_MASK;
+		new += incby;
+		state = arch_atomic_cmpxchg(&ct->state, old, new);
+	} while (old != state);
+
+	return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
 static __always_inline bool warn_rcu_enter(void)
 {
 	bool ret = false;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 0b81248aa03e2..d2c302133672f 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/static_key.h>
 #include <linux/context_tracking_irq.h>
+#include <linux/context_tracking_work.h>
 
 /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
 #define CT_NESTING_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
@@ -39,16 +40,19 @@ struct context_tracking {
 };
 
 /*
- * We cram two different things within the same atomic variable:
+ * We cram up to three different things within the same atomic variable:
  *
- *                     CT_RCU_WATCHING_START  CT_STATE_START
- *                                |                |
- *                                v                v
- *     MSB [ RCU watching counter ][ context_state ] LSB
- *         ^                       ^
- *         |                       |
- * CT_RCU_WATCHING_END        CT_STATE_END
+ *                     CT_RCU_WATCHING_START                  CT_STATE_START
+ *                                |         CT_WORK_START          |
+ *                                |               |                |
+ *                                v               v                v
+ *     MSB [ RCU watching counter ][ context work ][ context_state ] LSB
+ *         ^                       ^               ^
+ *         |                       |               |
+ *         |                  CT_WORK_END          |
+ * CT_RCU_WATCHING_END                        CT_STATE_END
  *
+ * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n
  * Bits are used from the LSB upwards, so unused bits (if any) will always be in
  * upper bits of the variable.
  */
@@ -59,18 +63,24 @@ struct context_tracking {
 #define CT_STATE_START 0
 #define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
 
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0)
+#define	CT_WORK_START (CT_STATE_END + 1)
+#define CT_WORK_END   (CT_WORK_START + CT_WORK_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH)
 #define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
-#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_START     (CT_WORK_END + 1)
 #define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
 #define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
 
 #define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_WORK_MASK         GENMASK(CT_WORK_END,         CT_WORK_START)
 #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
 
 #define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
 
 static_assert(CT_STATE_WIDTH        +
+	      CT_WORK_WIDTH         +
 	      CT_RCU_WATCHING_WIDTH +
 	      CT_UNUSED_WIDTH       ==
 	      CT_SIZE);
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..c68245f8d77c5
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CT_WORK_n_OFFSET,
+	CT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index fb5be6e9b423f..3238bb1f41ff4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,70 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+	int bit;
+
+	seq = (seq & CT_WORK_MASK) >> CT_WORK_START;
+
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &seq, CT_WORK_MAX)
+		arch_context_tracking_work(BIT(bit));
+}
+
+/**
+ * ct_set_cpu_work - set work to be run at next kernel context entry
+ *
+ * If @cpu is not currently executing in kernelspace, it will execute the
+ * callback mapped to @work (see arch_context_tracking_work()) at its next
+ * entry into ct_kernel_enter_state().
+ *
+ * If it is already executing in kernelspace, this will be a no-op.
+ */
+bool ct_set_cpu_work(unsigned int cpu, enum ct_work work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old;
+	bool ret = false;
+
+	if (!ct->active)
+		return false;
+
+	preempt_disable();
+
+	old = atomic_read(&ct->state);
+
+	/*
+	 * The work bit must only be set if the target CPU is not executing
+	 * in kernelspace.
+	 * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we
+	 * know for sure the CPU is executing in the kernel whether that be in
+	 * NMI, IRQ or process context.
+	 * Set CT_RCU_WATCHING here and let the cmpxchg do the check for us;
+	 * the state could change between the atomic_read() and the cmpxchg().
+	 */
+	old |= CT_RCU_WATCHING;
+	/*
+	 * Try setting the work until either
+	 * - the target CPU has entered kernelspace
+	 * - the work has been set
+	 */
+	do {
+		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
+	} while (!ret && !(old & CT_RCU_WATCHING));
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -88,7 +152,7 @@ static noinstr void ct_kernel_exit_state(int offset)
 	rcu_task_trace_heavyweight_enter();  // Before CT state update!
 	// RCU is still watching.  Better not be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !rcu_is_watching_curr_cpu());
-	(void)ct_state_inc(offset);
+	(void)ct_state_inc_clear_work(offset);
 	// RCU is no longer watching.
 }
 
@@ -99,7 +163,7 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
-	int seq;
+	unsigned long seq;
 
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
@@ -107,6 +171,7 @@ static noinstr void ct_kernel_enter_state(int offset)
 	 * critical section.
 	 */
 	seq = ct_state_inc(offset);
+	ct_work_flush(seq);
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_task_trace_heavyweight_exit();  // After CT state update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f7836ce..1a0c027aad141 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (22 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-28 14:49   ` Frederic Weisbecker
  2025-10-10 15:38 ` [PATCH v6 25/29] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
                   ` (6 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Peter Zijlstra (Intel), Nicolas Saenz Julienne, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- static calls
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys and static calls.

Any static key used in early entry code should be only forever-enabled at
boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions
are explicitly marked as allowed in .noinstr and will always generate an
IPI when flipped.

The same applies to static calls - they should be only updated at boot
time, or manually marked as an exception.

Objtool is now able to point at static keys/calls that don't respect this,
and all static keys/calls used in early entry code have now been verified
as behaving appropriately.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  6 ++-
 arch/x86/include/asm/text-patching.h         |  1 +
 arch/x86/kernel/alternative.c                | 39 +++++++++++++++++---
 arch/x86/kernel/kprobes/core.c               |  4 +-
 arch/x86/kernel/kprobes/opt.c                |  4 +-
 arch/x86/kernel/module.c                     |  2 +-
 include/asm-generic/sections.h               | 15 ++++++++
 include/linux/context_tracking_work.h        |  4 +-
 8 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5f3b2d0977235..485b32881fde5 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include <asm/sync_core.h>
+
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
 	switch (work) {
-	case CT_WORK_n:
-		// Do work...
+	case CT_WORK_SYNC:
+		sync_core();
 		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 5337f1be18f6e..a33541ab210db 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,7 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void smp_text_poke_sync_each_cpu(void);
+extern void smp_text_poke_sync_each_cpu_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 #define text_poke_copy text_poke_copy
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7bde68247b5fc..07c91f0a30eaf 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -6,6 +6,7 @@
 #include <linux/vmalloc.h>
 #include <linux/memory.h>
 #include <linux/execmem.h>
+#include <linux/context_tracking.h>
 
 #include <asm/text-patching.h>
 #include <asm/insn.h>
@@ -2648,9 +2649,24 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CT_WORK_SYNC);
+}
+
+static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func)
+{
+	on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void smp_text_poke_sync_each_cpu(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	__smp_text_poke_sync_each_cpu(NULL);
+}
+
+void smp_text_poke_sync_each_cpu_deferrable(void)
+{
+	__smp_text_poke_sync_each_cpu(do_sync_core_defer_cond);
 }
 
 /*
@@ -2820,6 +2836,7 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs)
  */
 void smp_text_poke_batch_finish(void)
 {
+	smp_cond_func_t cond = do_sync_core_defer_cond;
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
 	int do_sync;
@@ -2856,11 +2873,21 @@ void smp_text_poke_batch_finish(void)
 	 * First step: add a INT3 trap to the address that will be patched.
 	 */
 	for (i = 0; i < text_poke_array.nr_entries; i++) {
-		text_poke_array.vec[i].old = *(u8 *)text_poke_addr(&text_poke_array.vec[i]);
-		text_poke(text_poke_addr(&text_poke_array.vec[i]), &int3, INT3_INSN_SIZE);
+		void *addr = text_poke_addr(&text_poke_array.vec[i]);
+
+		/*
+		 * There's no safe way to defer IPIs for patching text in
+		 * .noinstr, record whether there is at least one such poke.
+		 */
+		if (is_kernel_noinstr_text((unsigned long)addr) ||
+		    is_module_noinstr_text_address((unsigned long)addr))
+			cond = NULL;
+
+		text_poke_array.vec[i].old = *((u8 *)addr);
+		text_poke(addr, &int3, INT3_INSN_SIZE);
 	}
 
-	smp_text_poke_sync_each_cpu();
+	__smp_text_poke_sync_each_cpu(cond);
 
 	/*
 	 * Second step: update all but the first byte of the patched range.
@@ -2922,7 +2949,7 @@ void smp_text_poke_batch_finish(void)
 		 * not necessary and we'd be safe even without it. But
 		 * better safe than sorry (plus there's not only Intel).
 		 */
-		smp_text_poke_sync_each_cpu();
+		__smp_text_poke_sync_each_cpu(cond);
 	}
 
 	/*
@@ -2943,7 +2970,7 @@ void smp_text_poke_batch_finish(void)
 	}
 
 	if (do_sync)
-		smp_text_poke_sync_each_cpu();
+		__smp_text_poke_sync_each_cpu(cond);
 
 	/*
 	 * Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 6079d15dab8ca..cce08add9aa0e 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -790,7 +790,7 @@ void arch_arm_kprobe(struct kprobe *p)
 	u8 int3 = INT3_INSN_OPCODE;
 
 	text_poke(p->addr, &int3, 1);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 	perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
 }
 
@@ -800,7 +800,7 @@ void arch_disarm_kprobe(struct kprobe *p)
 
 	perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
 	text_poke(p->addr, &p->opcode, 1);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 0aabd4c4e2c4f..eada8dca1c2e8 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	       JMP32_INSN_SIZE - INT3_INSN_SIZE);
 
 	text_poke(addr, new, INT3_INSN_SIZE);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 	text_poke(addr + INT3_INSN_SIZE,
 		  new + INT3_INSN_SIZE,
 		  JMP32_INSN_SIZE - INT3_INSN_SIZE);
-	smp_text_poke_sync_each_cpu();
+	smp_text_poke_sync_each_cpu_deferrable();
 
 	perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
 }
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 0ffbae902e2fe..c6c4f391eb465 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -206,7 +206,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 				   write, apply);
 
 	if (!early) {
-		smp_text_poke_sync_each_cpu();
+		smp_text_poke_sync_each_cpu_deferrable();
 		mutex_unlock(&text_mutex);
 	}
 
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index 0755bc39b0d80..7d2403014010e 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr)
 	       addr < (unsigned long)_einittext;
 }
 
+
+/**
+ * is_kernel_noinstr_text - checks if the pointer address is located in the
+ *                    .noinstr section
+ *
+ * @addr: address to check
+ *
+ * Returns: true if the address is located in .noinstr, false otherwise.
+ */
+static inline bool is_kernel_noinstr_text(unsigned long addr)
+{
+	return addr >= (unsigned long)__noinstr_text_start &&
+	       addr < (unsigned long)__noinstr_text_end;
+}
+
 /**
  * __is_kernel_text - checks if the pointer address is located in the
  *                    .text section
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index c68245f8d77c5..2facc621be067 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,12 @@
 #include <linux/bitops.h>
 
 enum {
-	CT_WORK_n_OFFSET,
+	CT_WORK_SYNC_OFFSET,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
-	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
 	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 25/29] x86/mm: Make INVPCID type macros available to assembly
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (23 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [RFC PATCH v6 26/29] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

A later commit will introduce a pure-assembly INVPCID invocation, allow
assembly files to get the type definitions.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/invpcid.h | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h
index 734482afbf81d..27ae75c2d7fed 100644
--- a/arch/x86/include/asm/invpcid.h
+++ b/arch/x86/include/asm/invpcid.h
@@ -2,6 +2,13 @@
 #ifndef _ASM_X86_INVPCID
 #define _ASM_X86_INVPCID
 
+#define INVPCID_TYPE_INDIV_ADDR		0
+#define INVPCID_TYPE_SINGLE_CTXT	1
+#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
+#define INVPCID_TYPE_ALL_NON_GLOBAL	3
+
+#ifndef __ASSEMBLER__
+
 static inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
 {
@@ -17,11 +24,6 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 		     :: [desc] "m" (desc), [type] "r" (type) : "memory");
 }
 
-#define INVPCID_TYPE_INDIV_ADDR		0
-#define INVPCID_TYPE_SINGLE_CTXT	1
-#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
-#define INVPCID_TYPE_ALL_NON_GLOBAL	3
-
 /* Flush all mappings for a given pcid and addr, not including globals. */
 static inline void invpcid_flush_one(unsigned long pcid,
 				     unsigned long addr)
@@ -47,4 +49,6 @@ static inline void invpcid_flush_all_nonglobals(void)
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+#endif /* __ASSEMBLER__ */
+
 #endif /* _ASM_X86_INVPCID */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v6 26/29] x86/mm/pti: Introduce a kernel/user CR3 software signal
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (24 preceding siblings ...)
  2025-10-10 15:38 ` [PATCH v6 25/29] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Later commits will rely on this information to defer kernel TLB flush
IPIs. Update it when switching to and from the kernel CR3.

This will only be really useful for NOHZ_FULL CPUs, but it should be
cheaper to unconditionally update a never-used per-CPU variable living in
its own cacheline than to check a shared cpumask such as
  housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
at every entry.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
Per the cover letter, I really hate this, but couldn't come up with
anything better.
---
 arch/x86/entry/calling.h        | 16 ++++++++++++++++
 arch/x86/entry/syscall_64.c     |  4 ++++
 arch/x86/include/asm/tlbflush.h |  3 +++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 94519688b0071..813451b1ddecc 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -171,11 +171,24 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

+.macro COALESCE_TLBI
+#ifdef CONFIG_COALESCE_TLBI
+	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
+#endif // CONFIG_COALESCE_TLBI
+.endm
+
+.macro NOTE_SWITCH_TO_USER_CR3
+#ifdef CONFIG_COALESCE_TLBI
+	movl     $0, PER_CPU_VAR(kernel_cr3_loaded)
+#endif // CONFIG_COALESCE_TLBI
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
+	COALESCE_TLBI
 .Lend_\@:
 .endm

@@ -183,6 +196,7 @@ For 32-bit we have the following conventions - kernel is built with
	PER_CPU_VAR(cpu_tlbstate + TLB_STATE_user_pcid_flush_mask)

 .macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req
+	NOTE_SWITCH_TO_USER_CR3
	mov	%cr3, \scratch_reg

	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -242,6 +256,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
+	COALESCE_TLBI

 .Ldone_\@:
 .endm
@@ -258,6 +273,7 @@ For 32-bit we have the following conventions - kernel is built with
	bt	$PTI_USER_PGTABLE_BIT, \save_reg
	jnc	.Lend_\@

+	NOTE_SWITCH_TO_USER_CR3
	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

	/*
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index b6e68ea98b839..2589d232e0ba1 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -83,6 +83,10 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
	return false;
 }

+#ifdef CONFIG_COALESCE_TLBI
+DEFINE_PER_CPU(bool, kernel_cr3_loaded) = true;
+#endif
+
 /* Returns true to return using SYSRET, or false to use IRET */
 __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
 {
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 00daedfefc1b0..e39ae95b85072 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,9 @@
 #include <asm/pgtable.h>

 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+#ifdef CONFIG_COALESCE_TLBI
+DECLARE_PER_CPU(bool, kernel_cr3_loaded);
+#endif

 void __flush_tlb_all(void);

--
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (25 preceding siblings ...)
  2025-10-10 15:38 ` [RFC PATCH v6 26/29] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-28 15:59   ` Frederic Weisbecker
  2025-10-10 15:38 ` [RFC PATCH v6 28/29] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Deferring kernel range TLB flushes requires the guarantee that upon
entering the kernel, no stale entry may be accessed. The simplest way to
provide such a guarantee is to issue an unconditional flush upon switching
to the kernel CR3, as this is the pivoting point where such stale entries
may be accessed.

As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
CPUs.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/entry/calling.h      | 26 +++++++++++++++++++++++---
 arch/x86/kernel/asm-offsets.c |  1 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 813451b1ddecc..19fb6de276eac 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -9,6 +9,7 @@
 #include <asm/ptrace-abi.h>
 #include <asm/msr.h>
 #include <asm/nospec-branch.h>
+#include <asm/invpcid.h>

 /*

@@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

-.macro COALESCE_TLBI
+.macro COALESCE_TLBI scratch_reg:req
 #ifdef CONFIG_COALESCE_TLBI
+	/* No point in doing this for housekeeping CPUs */
+	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
+	bt	\scratch_reg, tick_nohz_full_mask(%rip)
+	jnc	.Lend_tlbi_\@
+
+	ALTERNATIVE "jmp .Lcr4_\@", "", X86_FEATURE_INVPCID
+	movq $(INVPCID_TYPE_ALL_INCL_GLOBAL), \scratch_reg
+	/* descriptor is all zeroes, point at the zero page */
+	invpcid empty_zero_page(%rip), \scratch_reg
+	jmp .Lend_tlbi_\@
+.Lcr4_\@:
+	/* Note: this gives CR4 pinning the finger */
+	movq PER_CPU_VAR(cpu_tlbstate + TLB_STATE_cr4), \scratch_reg
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+
+.Lend_tlbi_\@:
	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
 #endif // CONFIG_COALESCE_TLBI
 .endm
@@ -188,7 +208,7 @@ For 32-bit we have the following conventions - kernel is built with
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg
 .Lend_\@:
 .endm

@@ -256,7 +276,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg

 .Ldone_\@:
 .endm
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 6259b474073bc..f5abdcbb150d9 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,6 +105,7 @@ static void __used common(void)

	/* TLB state for the entry code */
	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_cr4, tlb_state, cr4);

	/* Layout info for cpu_entry_area */
	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
--
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v6 28/29] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (26 preceding siblings ...)
  2025-10-10 15:38 ` [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-10 15:38 ` [RFC PATCH v6 29/29] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Previous commits have added an unconditional TLB flush right after
switching to the kernel CR3 on NOHZ_FULL CPUs, and a software signal to
determine whether a CPU has its kernel CR3 loaded.

Using these two components, we can now safely defer kernel TLB flush IPIs
targeting NOHZ_FULL CPUs executing in userspace (i.e. with the user CR3
loaded).

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h |  3 +++
 arch/x86/mm/tlb.c               | 34 ++++++++++++++++++++++++++-------
 mm/vmalloc.c                    | 34 ++++++++++++++++++++++++++++-----
 3 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e39ae95b85072..6d533afd70952 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -321,6 +321,9 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
				unsigned long end, unsigned int stride_shift,
				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+#ifdef CONFIG_COALESCE_TLBI
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
+#endif

 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f17..aa3a83d5eccc2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/sched/isolation.h>

 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1509,23 +1510,24 @@ static void do_kernel_range_flush(void *info)
		flush_tlb_one_kernel(addr);
 }

-static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+static void kernel_tlb_flush_all(smp_cond_func_t cond, struct flush_tlb_info *info)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all();
	else
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond, do_flush_tlb_all, NULL, 1);
 }

-static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+static void kernel_tlb_flush_range(smp_cond_func_t cond, struct flush_tlb_info *info)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_kernel_range_flush(info);
	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond, do_kernel_range_flush, info, 1);
 }

-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond, unsigned long start, unsigned long end)
 {
	struct flush_tlb_info *info;

@@ -1535,13 +1537,31 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
				  TLB_GENERATION_INVALID);

	if (info->end == TLB_FLUSH_ALL)
-		kernel_tlb_flush_all(info);
+		kernel_tlb_flush_all(cond, info);
	else
-		kernel_tlb_flush_range(info);
+		kernel_tlb_flush_range(cond, info);

	put_flush_tlb_info();
 }

+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+#ifdef CONFIG_COALESCE_TLBI
+static bool flush_tlb_kernel_cond(int cpu, void *info)
+{
+	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
+	       per_cpu(kernel_cr3_loaded, cpu);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(flush_tlb_kernel_cond, start, end);
+}
+#endif
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5edd536ba9d2a..c42f413a7a693 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -494,6 +494,30 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
	__vunmap_range_noflush(start, end);
 }

+#ifdef CONFIG_COALESCE_TLBI
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is a recipe for disaster.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -507,7 +531,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
 {
	flush_cache_vunmap(addr, end);
	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }

 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2333,7 +2357,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,

	nr_purge_nodes = cpumask_weight(&purge_nodes);
	if (nr_purge_nodes > 0) {
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);

		/* One extra worker is per a lazy_max_pages() full set minus one. */
		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2436,7 +2460,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
	flush_cache_vunmap(va->va_start, va->va_end);
	vunmap_range_noflush(va->va_start, va->va_end);
	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);

	free_vmap_area_noflush(va);
 }
@@ -2884,7 +2908,7 @@ static void vb_free(unsigned long addr, unsigned long size)
	vunmap_range_noflush(addr, addr + size);

	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);

	spin_lock(&vb->lock);

@@ -2949,7 +2973,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
	free_purged_blocks(&purge_list);

	if (!__purge_vmap_area_lazy(start, end, false) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
	mutex_unlock(&vmap_purge_lock);
 }

--
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v6 29/29] x86/entry: Add an option to coalesce TLB flushes
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (27 preceding siblings ...)
  2025-10-10 15:38 ` [RFC PATCH v6 28/29] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
@ 2025-10-10 15:38 ` Valentin Schneider
  2025-10-14 12:58 ` [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Juri Lelli
  2025-10-28 16:25 ` Frederic Weisbecker
  30 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-10 15:38 UTC (permalink / raw)
  To: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Previous patches have introduced a mechanism to prevent kernel text updates
from inducing interference on isolated CPUs. A similar action is required
for kernel-range TLB flushes in order to silence the biggest remaining
cause of isolated CPU IPI interference.

These flushes are mostly caused by vmalloc manipulations - e.g. on x86 with
CONFIG_VMAP_STACK, spawning enough processes will easily trigger
flushes. Unfortunately, the newly added context_tracking IPI deferral
mechanism cannot be leveraged for TLB flushes, as the deferred work would
be executed too late. Consider the following execution flow:

  <userspace>

  !interrupt!

  SWITCH_TO_KERNEL_CR3 // vmalloc range becomes accessible

  idtentry_func_foo()
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush() // deferred flush would be done here

Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry. Dave had suggested
hacking up something within SWITCH_TO_KERNEL_CR3 itself, which is what has
been implemented in the previous patches.

Make kernel-range TLB flush deferral available via CONFIG_COALESCE_TLBI.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/Kconfig | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3f1557b7acd8f..390e1dbe5d4de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2188,6 +2188,23 @@ config ADDRESS_MASKING
	  The capability can be used for efficient address sanitizers (ASAN)
	  implementation and for optimizations in JITs.

+config COALESCE_TLBI
+       def_bool n
+       prompt "Coalesce kernel TLB flushes for NOHZ-full CPUs"
+       depends on X86_64 && MITIGATION_PAGE_TABLE_ISOLATION && NO_HZ_FULL
+       help
+	 TLB flushes for kernel addresses can lead to IPIs being sent to
+	 NOHZ-full CPUs, thus kicking them out of userspace.
+
+	 This option coalesces kernel-range TLB flushes for NOHZ-full CPUs into
+	 a single flush executed at kernel entry, right after switching to the
+	 kernel page table. Note that this flush is unconditionnal, even if no
+	 remote flush was issued during the previous userspace execution window.
+
+	 This obviously makes the user->kernel transition overhead even worse.
+
+	 If unsure, say N.
+
 config HOTPLUG_CPU
	def_bool y
	depends on SMP
--
2.51.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-10-10 15:38 ` [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
@ 2025-10-14  0:01   ` Sean Christopherson
  2025-10-14 11:02     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Sean Christopherson @ 2025-10-14  0:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Josh Poimboeuf,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On Fri, Oct 10, 2025, Valentin Schneider wrote:
> Later commits will cause objtool to warn about static keys being used in
> .noinstr sections in order to safely defer instruction patching IPIs
> targeted at NOHZ_FULL CPUs.
> 
> These keys are used in .noinstr code, and can be modified at runtime
> (/proc/kernel/vmx* write). However it is not expected that they will be
> flipped during latency-sensitive operations, and thus shouldn't be a source
> of interference wrt the text patching IPI.
>
> Mark it to let objtool know not to warn about it.

Can you elaborate in the changelog on what will happen if the key is toggle?
IIUC, smp_text_poke_batch_finish() will force IPIs if noinstr code is being
patched.  Even just a small footnote like this:

  Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
  being patched, i.e. this is purely about silencing objtool warnings.

to make it clear that there's no bug/race being introduced.

> Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index aa157fe5b7b31..dce2bd7375ec8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -204,8 +204,15 @@ module_param(pt_mode, int, S_IRUGO);
>  
>  struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
>  
> -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
> -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
> +/*
> + * Both of these static keys end up being used in .noinstr sections, however
> + * they are only modified:
> + * - at init
> + * - from a /proc/kernel/vmx* write
> + * thus during latency-sensitive operations they should remain stable.
> + */
> +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush);
> +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond);
>  static DEFINE_MUTEX(vmx_l1d_flush_mutex);
>  
>  /* Storage for pre module init parameter parsing */
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  2025-10-10 15:38 ` [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
@ 2025-10-14  0:02   ` Sean Christopherson
  2025-10-14 11:20     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Sean Christopherson @ 2025-10-14  0:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

On Fri, Oct 10, 2025, Valentin Schneider wrote:
> The static key is only ever enabled in
> 
>   __init hv_init_evmcs()
> 
> so mark it appropriately as __ro_after_init.
> 
> Reported-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---

Acked-by: Sean Christopherson <seanjc@google.com>

Holler if you want me to grab this for 6.19.  I assume the plan is to try and
take the whole series through tip?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-10-14  0:01   ` Sean Christopherson
@ 2025-10-14 11:02     ` Valentin Schneider
  2025-10-14 19:06       ` Sean Christopherson
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-14 11:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Josh Poimboeuf,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 13/10/25 17:01, Sean Christopherson wrote:
> On Fri, Oct 10, 2025, Valentin Schneider wrote:
>> Later commits will cause objtool to warn about static keys being used in
>> .noinstr sections in order to safely defer instruction patching IPIs
>> targeted at NOHZ_FULL CPUs.
>>
>> These keys are used in .noinstr code, and can be modified at runtime
>> (/proc/kernel/vmx* write). However it is not expected that they will be
>> flipped during latency-sensitive operations, and thus shouldn't be a source
>> of interference wrt the text patching IPI.
>>
>> Mark it to let objtool know not to warn about it.
>
> Can you elaborate in the changelog on what will happen if the key is toggle?
> IIUC, smp_text_poke_batch_finish() will force IPIs if noinstr code is being
> patched.

Right!

> Even just a small footnote like this:
>
>   Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
>   being patched, i.e. this is purely about silencing objtool warnings.
>
> to make it clear that there's no bug/race being introduced.

Good point. How about:

"""
Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

The VMX keys are used in .noinstr code, and can be modified at runtime
(/proc/kernel/vmx* write). However it is not expected that they will be
flipped during latency-sensitive operations, and thus shouldn't be a source
of interference for NOHZ_FULL CPUs wrt the text patching IPI.

Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
being patched, i.e. this is purely to tell objtool we're okay with updates
to that key causing IPIs and to silence the associated objtool warning.
"""


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  2025-10-14  0:02   ` Sean Christopherson
@ 2025-10-14 11:20     ` Valentin Schneider
  0 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-14 11:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

On 13/10/25 17:02, Sean Christopherson wrote:
> On Fri, Oct 10, 2025, Valentin Schneider wrote:
>> The static key is only ever enabled in
>>
>>   __init hv_init_evmcs()
>>
>> so mark it appropriately as __ro_after_init.
>>
>> Reported-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>
> Acked-by: Sean Christopherson <seanjc@google.com>
>
> Holler if you want me to grab this for 6.19.  I assume the plan is to try and
> take the whole series through tip?

Thanks! At the very least getting all the __ro_after_init patches in would
be good since they're standalone, I'll wait a bit to see how this goes :)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (28 preceding siblings ...)
  2025-10-10 15:38 ` [RFC PATCH v6 29/29] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
@ 2025-10-14 12:58 ` Juri Lelli
  2025-10-14 15:26   ` Valentin Schneider
  2025-10-28 16:25 ` Frederic Weisbecker
  30 siblings, 1 reply; 54+ messages in thread
From: Juri Lelli @ 2025-10-14 12:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

Hello,

On 10/10/25 17:38, Valentin Schneider wrote:

...

> Performance
> +++++++++++
> 
> Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
> NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
> housekeeping CPUs:
> 
> o Xeon E5-2699:   base avg 770ns,  patched avg 1340ns (74% increase)
> o Xeon E7-8890:   base avg 1040ns, patched avg 1320ns (27% increase)
> o Xeon Gold 6248: base avg 270ns,  patched avg 273ns  (.1% increase)
> 
> I don't get that last one, I did spend a ridiculous amount of time making sure
> the flush was being executed, and AFAICT yes, it was. What I take out of this is
> that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
> CPUs), and that's something I want to hear thoughts on
> 
> Noise
> +++++
> 
> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
> RHEL10 userspace.
> 
> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
> 
> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
> 	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
> 		   rteval --onlyload --loads-cpulist=$HK_CPUS \
> 		   --hackbench-runlowmem=True --duration=$DURATION
> 
> This only records IPIs sent to isolated CPUs, so any event there is interference
> (with a bit of fuzz at the start/end of the workload when spawning the
> processes). All tests were done with a duration of 6 hours.
> 
> v6.17
> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
> o About one interfering IPI just shy of every 2 minutes
> 
> v6.17 + patches
> o Zilch!

Nice. :)

About performance, can we assume housekeeping CPUs are not affected by
the change (they don't seem to use the trick anyway) or do we want/need
to collect some numbers on them as well just in case (maybe more
throughput oriented)?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-14 12:58 ` [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Juri Lelli
@ 2025-10-14 15:26   ` Valentin Schneider
  2025-10-15 13:16     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-14 15:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

On 14/10/25 14:58, Juri Lelli wrote:
>> Noise
>> +++++
>>
>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
>> RHEL10 userspace.
>>
>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
>>
>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
>>                 -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
>>                 rteval --onlyload --loads-cpulist=$HK_CPUS \
>>                 --hackbench-runlowmem=True --duration=$DURATION
>>
>> This only records IPIs sent to isolated CPUs, so any event there is interference
>> (with a bit of fuzz at the start/end of the workload when spawning the
>> processes). All tests were done with a duration of 6 hours.
>>
>> v6.17
>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
>> o About one interfering IPI just shy of every 2 minutes
>>
>> v6.17 + patches
>> o Zilch!
>
> Nice. :)
>
> About performance, can we assume housekeeping CPUs are not affected by
> the change (they don't seem to use the trick anyway) or do we want/need
> to collect some numbers on them as well just in case (maybe more
> throughput oriented)?
>

So for the text_poke IPI yes, because this is all done through
context_tracking which doesn't imply housekeeping CPUs.

For the TLB flush faff the HK CPUs get two extra writes per kernel entry
cycle (one at entry and one at exit, for that stupid signal) which I expect
to be noticeable but small-ish. I can definitely go and measure that.

> Thanks,
> Juri


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-10-14 11:02     ` Valentin Schneider
@ 2025-10-14 19:06       ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2025-10-14 19:06 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Josh Poimboeuf,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On Tue, Oct 14, 2025, Valentin Schneider wrote:
> On 13/10/25 17:01, Sean Christopherson wrote:
> > On Fri, Oct 10, 2025, Valentin Schneider wrote:
> >> Later commits will cause objtool to warn about static keys being used in
> >> .noinstr sections in order to safely defer instruction patching IPIs
> >> targeted at NOHZ_FULL CPUs.
> >>
> >> These keys are used in .noinstr code, and can be modified at runtime
> >> (/proc/kernel/vmx* write). However it is not expected that they will be
> >> flipped during latency-sensitive operations, and thus shouldn't be a source
> >> of interference wrt the text patching IPI.
> >>
> >> Mark it to let objtool know not to warn about it.
> >
> > Can you elaborate in the changelog on what will happen if the key is toggle?
> > IIUC, smp_text_poke_batch_finish() will force IPIs if noinstr code is being
> > patched.
> 
> Right!
> 
> > Even just a small footnote like this:
> >
> >   Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
> >   being patched, i.e. this is purely about silencing objtool warnings.
> >
> > to make it clear that there's no bug/race being introduced.
> 
> Good point. How about:
> 
> """
> Later commits will cause objtool to warn about static keys being used in
> .noinstr sections in order to safely defer instruction patching IPIs
> targeted at NOHZ_FULL CPUs.
> 
> The VMX keys are used in .noinstr code, and can be modified at runtime
> (/proc/kernel/vmx* write). However it is not expected that they will be
> flipped during latency-sensitive operations, and thus shouldn't be a source
> of interference for NOHZ_FULL CPUs wrt the text patching IPI.
> 
> Note, smp_text_poke_batch_finish() never defers IPIs if noinstr code is
> being patched, i.e. this is purely to tell objtool we're okay with updates
> to that key causing IPIs and to silence the associated objtool warning.
> """

LGTM.  With the updated changelog,

Acked-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-14 15:26   ` Valentin Schneider
@ 2025-10-15 13:16     ` Valentin Schneider
  2025-10-15 14:28       ` Juri Lelli
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-15 13:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

On 14/10/25 17:26, Valentin Schneider wrote:
> On 14/10/25 14:58, Juri Lelli wrote:
>>> Noise
>>> +++++
>>>
>>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
>>> RHEL10 userspace.
>>>
>>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
>>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
>>>
>>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
>>>                 -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
>>>                 rteval --onlyload --loads-cpulist=$HK_CPUS \
>>>                 --hackbench-runlowmem=True --duration=$DURATION
>>>
>>> This only records IPIs sent to isolated CPUs, so any event there is interference
>>> (with a bit of fuzz at the start/end of the workload when spawning the
>>> processes). All tests were done with a duration of 6 hours.
>>>
>>> v6.17
>>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
>>> o About one interfering IPI just shy of every 2 minutes
>>>
>>> v6.17 + patches
>>> o Zilch!
>>
>> Nice. :)
>>
>> About performance, can we assume housekeeping CPUs are not affected by
>> the change (they don't seem to use the trick anyway) or do we want/need
>> to collect some numbers on them as well just in case (maybe more
>> throughput oriented)?
>>
>
> So for the text_poke IPI yes, because this is all done through
> context_tracking which doesn't imply housekeeping CPUs.
>
> For the TLB flush faff the HK CPUs get two extra writes per kernel entry
> cycle (one at entry and one at exit, for that stupid signal) which I expect
> to be noticeable but small-ish. I can definitely go and measure that.
>

On that same Xeon E5-2699 system with the same tuning, the average time
taken for 300M gettid syscalls on housekeeping CPUs is
  v6.17:          698.64ns ± 2.35ns
  v6.17 + series: 702.60ns ± 3.43ns

So noticeable (~.6% worse) but not horrible?

>> Thanks,
>> Juri


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-15 13:16     ` Valentin Schneider
@ 2025-10-15 14:28       ` Juri Lelli
  0 siblings, 0 replies; 54+ messages in thread
From: Juri Lelli @ 2025-10-15 14:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
	Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Clark Williams, Yair Podemsky, Marcelo Tosatti, Daniel Wagner,
	Petr Tesarik

On 15/10/25 15:16, Valentin Schneider wrote:
> On 14/10/25 17:26, Valentin Schneider wrote:
> > On 14/10/25 14:58, Juri Lelli wrote:
> >>> Noise
> >>> +++++
> >>>
> >>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
> >>> RHEL10 userspace.
> >>>
> >>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
> >>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
> >>>
> >>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
> >>>                 -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
> >>>                 rteval --onlyload --loads-cpulist=$HK_CPUS \
> >>>                 --hackbench-runlowmem=True --duration=$DURATION
> >>>
> >>> This only records IPIs sent to isolated CPUs, so any event there is interference
> >>> (with a bit of fuzz at the start/end of the workload when spawning the
> >>> processes). All tests were done with a duration of 6 hours.
> >>>
> >>> v6.17
> >>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
> >>> o About one interfering IPI just shy of every 2 minutes
> >>>
> >>> v6.17 + patches
> >>> o Zilch!
> >>
> >> Nice. :)
> >>
> >> About performance, can we assume housekeeping CPUs are not affected by
> >> the change (they don't seem to use the trick anyway) or do we want/need
> >> to collect some numbers on them as well just in case (maybe more
> >> throughput oriented)?
> >>
> >
> > So for the text_poke IPI yes, because this is all done through
> > context_tracking which doesn't imply housekeeping CPUs.
> >
> > For the TLB flush faff the HK CPUs get two extra writes per kernel entry
> > cycle (one at entry and one at exit, for that stupid signal) which I expect
> > to be noticeable but small-ish. I can definitely go and measure that.
> >
> 
> On that same Xeon E5-2699 system with the same tuning, the average time
> taken for 300M gettid syscalls on housekeeping CPUs is
>   v6.17:          698.64ns ± 2.35ns
>   v6.17 + series: 702.60ns ± 3.43ns
> 
> So noticeable (~.6% worse) but not horrible?

Yeah, seems reasonable.

Thanks for collecting numbers!


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure
  2025-10-10 15:38 ` [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2025-10-28 14:00   ` Frederic Weisbecker
  2025-10-29 10:09     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-28 14:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel,
	Nicolas Saenz Julienne, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Fri, Oct 10, 2025 at 05:38:33PM +0200, Valentin Schneider a écrit :
> smp_call_function() & friends have the unfortunate habit of sending IPIs to
> isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
> CPUs.
> 
> Some callsites can be bent into doing the right, such as done by commit:
> 
>   cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
> 
> Unfortunately, not all SMP callbacks can be omitted in this
> fashion. However, some of them only affect execution in kernelspace, which
> means they don't have to be executed *immediately* if the target CPU is in
> userspace: stashing the callback and executing it upon the next kernel entry
> would suffice. x86 kernel instruction patching or kernel TLB invalidation
> are prime examples of it.
> 
> Reduce the RCU dynticks counter width to free up some bits to be used as a
> deferred callback bitmask. Add some build-time checks to validate that
> setup.
> 
> Presence of CT_RCU_WATCHING in the ct_state prevents queuing deferred work.
> 
> Later commits introduce the bit:callback mappings.
> 
> Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/Kconfig                                 |  9 +++
>  arch/x86/Kconfig                             |  1 +
>  arch/x86/include/asm/context_tracking_work.h | 16 +++++
>  include/linux/context_tracking.h             | 21 ++++++
>  include/linux/context_tracking_state.h       | 30 ++++++---
>  include/linux/context_tracking_work.h        | 26 ++++++++
>  kernel/context_tracking.c                    | 69 +++++++++++++++++++-
>  kernel/time/Kconfig                          |  5 ++
>  8 files changed, 165 insertions(+), 12 deletions(-)
>  create mode 100644 arch/x86/include/asm/context_tracking_work.h
>  create mode 100644 include/linux/context_tracking_work.h
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d1b4ffd6e0856..a33229e017467 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -968,6 +968,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
>  	  - No use of instrumentation, unless instrumentation_begin() got
>  	    called.
>  
> +config HAVE_CONTEXT_TRACKING_WORK
> +	bool
> +	help
> +	  Architecture supports deferring work while not in kernel context.
> +	  This is especially useful on setups with isolated CPUs that might
> +	  want to avoid being interrupted to perform housekeeping tasks (for
> +	  ex. TLB invalidation or icache invalidation). The housekeeping
> +	  operations are performed upon re-entering the kernel.
> +
>  config HAVE_TIF_NOHZ
>  	bool
>  	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 05880301212e3..3f1557b7acd8f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -222,6 +222,7 @@ config X86
>  	select HAVE_CMPXCHG_LOCAL
>  	select HAVE_CONTEXT_TRACKING_USER		if X86_64
>  	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
> +	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
>  	select HAVE_C_RECORDMCOUNT
>  	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
>  	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
> diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
> new file mode 100644
> index 0000000000000..5f3b2d0977235
> --- /dev/null
> +++ b/arch/x86/include/asm/context_tracking_work.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
> +#define _ASM_X86_CONTEXT_TRACKING_WORK_H
> +
> +static __always_inline void arch_context_tracking_work(enum ct_work work)
> +{
> +	switch (work) {
> +	case CT_WORK_n:
> +		// Do work...
> +		break;
> +	case CT_WORK_MAX:
> +		WARN_ON_ONCE(true);
> +	}
> +}
> +
> +#endif
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index af9fe87a09225..0b0faa040e9b5 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -5,6 +5,7 @@
>  #include <linux/sched.h>
>  #include <linux/vtime.h>
>  #include <linux/context_tracking_state.h>
> +#include <linux/context_tracking_work.h>
>  #include <linux/instrumentation.h>
>  
>  #include <asm/ptrace.h>
> @@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
>  	return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
>  }
>  
> +#ifdef CONFIG_CONTEXT_TRACKING_WORK
> +static __always_inline unsigned long ct_state_inc_clear_work(int incby)
> +{
> +	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> +	unsigned long new, old, state;
> +
> +	state = arch_atomic_read(&ct->state);
> +	do {
> +		old = state;
> +		new = old & ~CT_WORK_MASK;
> +		new += incby;
> +		state = arch_atomic_cmpxchg(&ct->state, old, new);
> +	} while (old != state);
> +
> +	return new;
> +}
> +#else
> +#define ct_state_inc_clear_work(x) ct_state_inc(x)
> +#endif
> +
>  static __always_inline bool warn_rcu_enter(void)
>  {
>  	bool ret = false;
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index 0b81248aa03e2..d2c302133672f 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -5,6 +5,7 @@
>  #include <linux/percpu.h>
>  #include <linux/static_key.h>
>  #include <linux/context_tracking_irq.h>
> +#include <linux/context_tracking_work.h>
>  
>  /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
>  #define CT_NESTING_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
> @@ -39,16 +40,19 @@ struct context_tracking {
>  };
>  
>  /*
> - * We cram two different things within the same atomic variable:
> + * We cram up to three different things within the same atomic variable:
>   *
> - *                     CT_RCU_WATCHING_START  CT_STATE_START
> - *                                |                |
> - *                                v                v
> - *     MSB [ RCU watching counter ][ context_state ] LSB
> - *         ^                       ^
> - *         |                       |
> - * CT_RCU_WATCHING_END        CT_STATE_END
> + *                     CT_RCU_WATCHING_START                  CT_STATE_START
> + *                                |         CT_WORK_START          |
> + *                                |               |                |
> + *                                v               v                v
> + *     MSB [ RCU watching counter ][ context work ][ context_state ] LSB
> + *         ^                       ^               ^
> + *         |                       |               |
> + *         |                  CT_WORK_END          |
> + * CT_RCU_WATCHING_END                        CT_STATE_END
>   *
> + * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n
>   * Bits are used from the LSB upwards, so unused bits (if any) will always be in
>   * upper bits of the variable.
>   */
> @@ -59,18 +63,24 @@ struct context_tracking {
>  #define CT_STATE_START 0
>  #define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
>  
> -#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
> +#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0)
> +#define	CT_WORK_START (CT_STATE_END + 1)
> +#define CT_WORK_END   (CT_WORK_START + CT_WORK_WIDTH - 1)
> +
> +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH)
>  #define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
> -#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
> +#define CT_RCU_WATCHING_START     (CT_WORK_END + 1)
>  #define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
>  #define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
>  
>  #define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
> +#define CT_WORK_MASK         GENMASK(CT_WORK_END,         CT_WORK_START)
>  #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
>  
>  #define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
>  
>  static_assert(CT_STATE_WIDTH        +
> +	      CT_WORK_WIDTH         +
>  	      CT_RCU_WATCHING_WIDTH +
>  	      CT_UNUSED_WIDTH       ==
>  	      CT_SIZE);
> diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
> new file mode 100644
> index 0000000000000..c68245f8d77c5
> --- /dev/null
> +++ b/include/linux/context_tracking_work.h
> @@ -0,0 +1,26 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
> +#define _LINUX_CONTEXT_TRACKING_WORK_H
> +
> +#include <linux/bitops.h>
> +
> +enum {
> +	CT_WORK_n_OFFSET,
> +	CT_WORK_MAX_OFFSET
> +};
> +
> +enum ct_work {
> +	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
> +	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
> +};
> +
> +#include <asm/context_tracking_work.h>
> +
> +#ifdef CONFIG_CONTEXT_TRACKING_WORK
> +extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work);
> +#else
> +static inline bool
> +ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
> +#endif
> +
> +#endif
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index fb5be6e9b423f..3238bb1f41ff4 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -72,6 +72,70 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void)
>  #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
>  }
>  
> +#ifdef CONFIG_CONTEXT_TRACKING_WORK
> +static noinstr void ct_work_flush(unsigned long seq)
> +{
> +	int bit;
> +
> +	seq = (seq & CT_WORK_MASK) >> CT_WORK_START;
> +
> +	/*
> +	 * arch_context_tracking_work() must be noinstr, non-blocking,
> +	 * and NMI safe.
> +	 */
> +	for_each_set_bit(bit, &seq, CT_WORK_MAX)
> +		arch_context_tracking_work(BIT(bit));
> +}
> +
> +/**
> + * ct_set_cpu_work - set work to be run at next kernel context entry
> + *
> + * If @cpu is not currently executing in kernelspace, it will execute the
> + * callback mapped to @work (see arch_context_tracking_work()) at its next
> + * entry into ct_kernel_enter_state().
> + *
> + * If it is already executing in kernelspace, this will be a no-op.
> + */
> +bool ct_set_cpu_work(unsigned int cpu, enum ct_work work)
> +{
> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> +	unsigned int old;
> +	bool ret = false;
> +
> +	if (!ct->active)
> +		return false;
> +
> +	preempt_disable();
> +
> +	old = atomic_read(&ct->state);
> +
> +	/*
> +	 * The work bit must only be set if the target CPU is not executing
> +	 * in kernelspace.
> +	 * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we
> +	 * know for sure the CPU is executing in the kernel whether that be in
> +	 * NMI, IRQ or process context.
> +	 * Set CT_RCU_WATCHING here and let the cmpxchg do the check for us;
> +	 * the state could change between the atomic_read() and the cmpxchg().
> +	 */
> +	old |= CT_RCU_WATCHING;

Most of the time, the task should be either idle or in userspace. I'm still not
sure why you start with a bet that the CPU is in the kernel with RCU watching.

> +	/*
> +	 * Try setting the work until either
> +	 * - the target CPU has entered kernelspace
> +	 * - the work has been set
> +	 */
> +	do {
> +		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
> +	} while (!ret && !(old & CT_RCU_WATCHING));

So this applies blindly to idle as well, right? It should work but note that
idle entry code before RCU watches is also fragile.

The rest looks good.

Thanks!


> +
> +	preempt_enable();
> +	return ret;
> +}
> +#else
> +static __always_inline void ct_work_flush(unsigned long work) { }
> +static __always_inline void ct_work_clear(struct context_tracking *ct) { }
> +#endif
> +
>  /*
>   * Record entry into an extended quiescent state.  This is only to be
>   * called when not already in an extended quiescent state, that is,
> @@ -88,7 +152,7 @@ static noinstr void ct_kernel_exit_state(int offset)
>  	rcu_task_trace_heavyweight_enter();  // Before CT state update!
>  	// RCU is still watching.  Better not be in extended quiescent state!
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !rcu_is_watching_curr_cpu());
> -	(void)ct_state_inc(offset);
> +	(void)ct_state_inc_clear_work(offset);
>  	// RCU is no longer watching.
>  }
>  
> @@ -99,7 +163,7 @@ static noinstr void ct_kernel_exit_state(int offset)
>   */
>  static noinstr void ct_kernel_enter_state(int offset)
>  {
> -	int seq;
> +	unsigned long seq;
>  
>  	/*
>  	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
> @@ -107,6 +171,7 @@ static noinstr void ct_kernel_enter_state(int offset)
>  	 * critical section.
>  	 */
>  	seq = ct_state_inc(offset);
> +	ct_work_flush(seq);
>  	// RCU is now watching.  Better not be in an extended quiescent state!
>  	rcu_task_trace_heavyweight_exit();  // After CT state update!
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING));
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 7c6a52f7836ce..1a0c027aad141 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
>  	  Say N otherwise, this option brings an overhead that you
>  	  don't want in production.
>  
> +config CONTEXT_TRACKING_WORK
> +	bool
> +	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
> +	default y
> +
>  config NO_HZ
>  	bool "Old Idle dynticks config"
>  	help
> -- 
> 2.51.0
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs
  2025-10-10 15:38 ` [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2025-10-28 14:49   ` Frederic Weisbecker
  0 siblings, 0 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-28 14:49 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel,
	Peter Zijlstra (Intel), Nicolas Saenz Julienne, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Fri, Oct 10, 2025 at 05:38:34PM +0200, Valentin Schneider a écrit :
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.
> 
> As the synchronization IPIs are sent using a blocking call, returning from
> text_poke_bp_batch() implies all CPUs will observe the patched
> instruction(s), and this should be preserved even if the IPI is deferred.
> In other words, to safely defer this synchronization, any kernel
> instruction leading to the execution of the deferred instruction
> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
> 
> This means we must pay attention to mutable instructions in the early entry
> code:
> - alternatives
> - static keys
> - static calls
> - all sorts of probes (kprobes/ftrace/bpf/???)
> 
> The early entry code leading to ct_work_flush() is noinstr, which gets rid
> of the probes.
> 
> Alternatives are safe, because it's boot-time patching (before SMP is
> even brought up) which is before any IPI deferral can happen.
> 
> This leaves us with static keys and static calls.
> 
> Any static key used in early entry code should be only forever-enabled at
> boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions
> are explicitly marked as allowed in .noinstr and will always generate an
> IPI when flipped.
> 
> The same applies to static calls - they should be only updated at boot
> time, or manually marked as an exception.
> 
> Objtool is now able to point at static keys/calls that don't respect this,
> and all static keys/calls used in early entry code have now been verified
> as behaving appropriately.
> 
> Leverage the new context_tracking infrastructure to defer sync_core() IPIs
> to a target CPU's next kernel entry.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>

Acked-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-10 15:38 ` [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
@ 2025-10-28 15:59   ` Frederic Weisbecker
  2025-10-29 10:16     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-28 15:59 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> Deferring kernel range TLB flushes requires the guarantee that upon
> entering the kernel, no stale entry may be accessed. The simplest way to
> provide such a guarantee is to issue an unconditional flush upon switching
> to the kernel CR3, as this is the pivoting point where such stale entries
> may be accessed.
> 
> As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
> CPUs.
> 
> Note that the COALESCE_TLBI config option is introduced in a later commit,
> when the whole feature is implemented.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/x86/entry/calling.h      | 26 +++++++++++++++++++++++---
>  arch/x86/kernel/asm-offsets.c |  1 +
>  2 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 813451b1ddecc..19fb6de276eac 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -9,6 +9,7 @@
>  #include <asm/ptrace-abi.h>
>  #include <asm/msr.h>
>  #include <asm/nospec-branch.h>
> +#include <asm/invpcid.h>
> 
>  /*
> 
> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> 	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>  .endm
> 
> -.macro COALESCE_TLBI
> +.macro COALESCE_TLBI scratch_reg:req
>  #ifdef CONFIG_COALESCE_TLBI
> +	/* No point in doing this for housekeeping CPUs */
> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> +	jnc	.Lend_tlbi_\@

I assume it's not possible to have a static call/branch to
take care of all this ?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (29 preceding siblings ...)
  2025-10-14 12:58 ` [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Juri Lelli
@ 2025-10-28 16:25 ` Frederic Weisbecker
  2025-10-29 10:32   ` Valentin Schneider
  30 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-28 16:25 UTC (permalink / raw)
  To: Valentin Schneider, Phil Auld
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

+Cc Phil Auld

Le Fri, Oct 10, 2025 at 05:38:10PM +0200, Valentin Schneider a écrit :
> Patches
> =======
> 
> o Patches 1-2 are standalone objtool cleanups.

Would be nice to get these merged.

> o Patches 3-4 add an RCU testing feature.

I'm taking this one.

> 
> o Patches 5-6 add infrastructure for annotating static keys and static calls
>   that may be used in noinstr code (courtesy of Josh).
> o Patches 7-20 use said annotations on relevant keys / calls.
> o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
> 
> o Patch 22 deals with detecting NOINSTR text in modules

Not sure how to route those. If we wait for each individual subsystem,
this may take a while.

> o Patches 23-24 deal with kernel text sync IPIs

I would be fine taking those (the concerns I had are just details)
but they depend on all the annotations. Alternatively I can take the whole
thing but then we'll need some acks.
 
> o Patches 25-29 deal with kernel range TLB flush IPIs

I'll leave these more time for now ;o)
And if they ever go somewhere, it should be through x86 tree.

Also, here is another candidate usecase for this deferral thing.
I remember Phil Auld complaining that stop_machine() on CPU offlining was
a big problem for nohz_full. Especially while we are working on
a cpuset interface to toggle nohz_full but this will require the CPUs
to be offline.

So my point is that when a CPU goes offline, stop_machine() puts all
CPUs into a loop with IRQs disabled. CPUs in userspace could possibly
escape that since they don't touch the kernel anyway. But as soon as
they enter the kernel, they should either acquire the final state of
stop_machine if completed or join the global loop if in the middle.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure
  2025-10-28 14:00   ` Frederic Weisbecker
@ 2025-10-29 10:09     ` Valentin Schneider
  2025-10-29 14:52       ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-29 10:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel,
	Nicolas Saenz Julienne, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 28/10/25 15:00, Frederic Weisbecker wrote:
> Le Fri, Oct 10, 2025 at 05:38:33PM +0200, Valentin Schneider a écrit :
>> +	old = atomic_read(&ct->state);
>> +
>> +	/*
>> +	 * The work bit must only be set if the target CPU is not executing
>> +	 * in kernelspace.
>> +	 * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we
>> +	 * know for sure the CPU is executing in the kernel whether that be in
>> +	 * NMI, IRQ or process context.
>> +	 * Set CT_RCU_WATCHING here and let the cmpxchg do the check for us;
>> +	 * the state could change between the atomic_read() and the cmpxchg().
>> +	 */
>> +	old |= CT_RCU_WATCHING;
>
> Most of the time, the task should be either idle or in userspace. I'm still not
> sure why you start with a bet that the CPU is in the kernel with RCU watching.
>

Right I think I got that the wrong way around when I switched to using
CT_RCU_WATCHING vs CT_STATE_KERNEL. That wants to be

  old &= ~CT_RCU_WATCHING;

i.e. bet the CPU is NOHZ-idle, if it's not the cmpxchg fails and we don't
store the work bit.

>> +	/*
>> +	 * Try setting the work until either
>> +	 * - the target CPU has entered kernelspace
>> +	 * - the work has been set
>> +	 */
>> +	do {
>> +		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
>> +	} while (!ret && !(old & CT_RCU_WATCHING));
>
> So this applies blindly to idle as well, right? It should work but note that
> idle entry code before RCU watches is also fragile.
>

Yeah I remember losing some hair trying to grok the idle entry situation;
we could keep this purely NOHZ_FULL and have the deferral condition be:

  (ct->state & CT_STATE_USER) && !(ct->state & CT_RCU_WATCHING)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-28 15:59   ` Frederic Weisbecker
@ 2025-10-29 10:16     ` Valentin Schneider
  2025-10-29 10:31       ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-29 10:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 28/10/25 16:59, Frederic Weisbecker wrote:
> Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
>> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
>>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>>  .endm
>>
>> -.macro COALESCE_TLBI
>> +.macro COALESCE_TLBI scratch_reg:req
>>  #ifdef CONFIG_COALESCE_TLBI
>> +	/* No point in doing this for housekeeping CPUs */
>> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
>> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
>> +	jnc	.Lend_tlbi_\@
>
> I assume it's not possible to have a static call/branch to
> take care of all this ?
>

I think technically yes, but that would have to be a per-cpu patchable
location, which would mean something like each CPU having its own copy of
that text page... Unless there's some existing way to statically optimize

  if (cpumask_test_cpu(smp_processor_id(), mask))

where @mask is a boot-time constant (i.e. the nohz_full mask).

> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-29 10:16     ` Valentin Schneider
@ 2025-10-29 10:31       ` Frederic Weisbecker
  2025-10-29 14:13         ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-29 10:31 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
> On 28/10/25 16:59, Frederic Weisbecker wrote:
> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
> >>  .endm
> >>
> >> -.macro COALESCE_TLBI
> >> +.macro COALESCE_TLBI scratch_reg:req
> >>  #ifdef CONFIG_COALESCE_TLBI
> >> +	/* No point in doing this for housekeeping CPUs */
> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> >> +	jnc	.Lend_tlbi_\@
> >
> > I assume it's not possible to have a static call/branch to
> > take care of all this ?
> >
> 
> I think technically yes, but that would have to be a per-cpu patchable
> location, which would mean something like each CPU having its own copy of
> that text page... Unless there's some existing way to statically optimize
> 
>   if (cpumask_test_cpu(smp_processor_id(), mask))
> 
> where @mask is a boot-time constant (i.e. the nohz_full mask).

Or just check housekeeping_overriden static key before everything. This one is
enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
it's on the way for the last one) are running, but those are all niche, which
means you spare 99.999% kernel usecases.

Thanks.

> 
> > Thanks.
> >
> > --
> > Frederic Weisbecker
> > SUSE Labs
> 
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-28 16:25 ` Frederic Weisbecker
@ 2025-10-29 10:32   ` Valentin Schneider
  2025-10-29 17:15     ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-29 10:32 UTC (permalink / raw)
  To: Frederic Weisbecker, Phil Auld
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 28/10/25 17:25, Frederic Weisbecker wrote:
> +Cc Phil Auld
>
> Le Fri, Oct 10, 2025 at 05:38:10PM +0200, Valentin Schneider a écrit :
>> Patches
>> =======
>>
>> o Patches 1-2 are standalone objtool cleanups.
>
> Would be nice to get these merged.
>
>> o Patches 3-4 add an RCU testing feature.
>
> I'm taking this one.
>

Thanks!

>>
>> o Patches 5-6 add infrastructure for annotating static keys and static calls
>>   that may be used in noinstr code (courtesy of Josh).
>> o Patches 7-20 use said annotations on relevant keys / calls.
>> o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
>>
>> o Patch 22 deals with detecting NOINSTR text in modules
>
> Not sure how to route those. If we wait for each individual subsystem,
> this may take a while.
>

At least the __ro_after_init ones could go as their own thing since they're
standalone, but yeah they're the ones touching all sorts of subsystems :/

>> o Patches 23-24 deal with kernel text sync IPIs
>
> I would be fine taking those (the concerns I had are just details)
> but they depend on all the annotations. Alternatively I can take the whole
> thing but then we'll need some acks.
>
>> o Patches 25-29 deal with kernel range TLB flush IPIs
>
> I'll leave these more time for now ;o)
> And if they ever go somewhere, it should be through x86 tree.
>
> Also, here is another candidate usecase for this deferral thing.
> I remember Phil Auld complaining that stop_machine() on CPU offlining was
> a big problem for nohz_full. Especially while we are working on
> a cpuset interface to toggle nohz_full but this will require the CPUs
> to be offline.
>

Yeah that does ring a bell...

> So my point is that when a CPU goes offline, stop_machine() puts all
> CPUs into a loop with IRQs disabled. CPUs in userspace could possibly
> escape that since they don't touch the kernel anyway. But as soon as
> they enter the kernel, they should either acquire the final state of
> stop_machine if completed or join the global loop if in the middle.
>

I need to have a think about that one; one pain point I see is the context
tracking work has to be NMI safe since e.g. an NMI can take us out of
userspace. Another is that NOHZ-full CPUs need to be special cased in the
stop machine queueing / completion.

/me goes fetch a new notebook

> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-29 10:31       ` Frederic Weisbecker
@ 2025-10-29 14:13         ` Valentin Schneider
  2025-10-29 14:49           ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Valentin Schneider @ 2025-10-29 14:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 29/10/25 11:31, Frederic Weisbecker wrote:
> Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
>> On 28/10/25 16:59, Frederic Weisbecker wrote:
>> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
>> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
>> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>> >>  .endm
>> >>
>> >> -.macro COALESCE_TLBI
>> >> +.macro COALESCE_TLBI scratch_reg:req
>> >>  #ifdef CONFIG_COALESCE_TLBI
>> >> +	/* No point in doing this for housekeeping CPUs */
>> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
>> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
>> >> +	jnc	.Lend_tlbi_\@
>> >
>> > I assume it's not possible to have a static call/branch to
>> > take care of all this ?
>> >
>>
>> I think technically yes, but that would have to be a per-cpu patchable
>> location, which would mean something like each CPU having its own copy of
>> that text page... Unless there's some existing way to statically optimize
>>
>>   if (cpumask_test_cpu(smp_processor_id(), mask))
>>
>> where @mask is a boot-time constant (i.e. the nohz_full mask).
>
> Or just check housekeeping_overriden static key before everything. This one is
> enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
> it's on the way for the last one) are running, but those are all niche, which
> means you spare 99.999% kernel usecases.
>

Oh right, if NOHZ_FULL is actually in use.

Yeah that housekeeping key could do since, at least for the cmdline
approach, it's set during start_kernel(). I need to have a think about the
runtime cpuset case.

Given we have ALTERNATIVE's in there I assume something like a
boot-time-driven static key could do, but I haven't found out yet if and
how that can be shoved in an ASM file.

> Thanks.
>
>>
>> > Thanks.
>> >
>> > --
>> > Frederic Weisbecker
>> > SUSE Labs
>>
>>
>
> --
> Frederic Weisbecker
> SUSE Labs


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-29 14:13         ` Valentin Schneider
@ 2025-10-29 14:49           ` Frederic Weisbecker
  2025-10-31  9:55             ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-29 14:49 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Wed, Oct 29, 2025 at 03:13:59PM +0100, Valentin Schneider a écrit :
> On 29/10/25 11:31, Frederic Weisbecker wrote:
> > Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
> >> On 28/10/25 16:59, Frederic Weisbecker wrote:
> >> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> >> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> >> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
> >> >>  .endm
> >> >>
> >> >> -.macro COALESCE_TLBI
> >> >> +.macro COALESCE_TLBI scratch_reg:req
> >> >>  #ifdef CONFIG_COALESCE_TLBI
> >> >> +	/* No point in doing this for housekeeping CPUs */
> >> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> >> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> >> >> +	jnc	.Lend_tlbi_\@
> >> >
> >> > I assume it's not possible to have a static call/branch to
> >> > take care of all this ?
> >> >
> >>
> >> I think technically yes, but that would have to be a per-cpu patchable
> >> location, which would mean something like each CPU having its own copy of
> >> that text page... Unless there's some existing way to statically optimize
> >>
> >>   if (cpumask_test_cpu(smp_processor_id(), mask))
> >>
> >> where @mask is a boot-time constant (i.e. the nohz_full mask).
> >
> > Or just check housekeeping_overriden static key before everything. This one is
> > enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
> > it's on the way for the last one) are running, but those are all niche, which
> > means you spare 99.999% kernel usecases.
> >
> 
> Oh right, if NOHZ_FULL is actually in use.
> 
> Yeah that housekeeping key could do since, at least for the cmdline
> approach, it's set during start_kernel(). I need to have a think about the
> runtime cpuset case.

You can ignore the runtime thing and simply check the static key before reading
the housekeeping mask. For now nohz_full is only enabled by cmdline.

> Given we have ALTERNATIVE's in there I assume something like a
> boot-time-driven static key could do, but I haven't found out yet if and
> how that can be shoved in an ASM file.

Right, I thought I had seen static keys in ASM already but I can't find it
anymore. arch/x86/include/asm/jump_label.h is full of reusable magic
though.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure
  2025-10-29 10:09     ` Valentin Schneider
@ 2025-10-29 14:52       ` Frederic Weisbecker
  0 siblings, 0 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-29 14:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel,
	Nicolas Saenz Julienne, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner, Petr Tesarik

Le Wed, Oct 29, 2025 at 11:09:50AM +0100, Valentin Schneider a écrit :
> On 28/10/25 15:00, Frederic Weisbecker wrote:
> > Le Fri, Oct 10, 2025 at 05:38:33PM +0200, Valentin Schneider a écrit :
> >> +	old = atomic_read(&ct->state);
> >> +
> >> +	/*
> >> +	 * The work bit must only be set if the target CPU is not executing
> >> +	 * in kernelspace.
> >> +	 * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we
> >> +	 * know for sure the CPU is executing in the kernel whether that be in
> >> +	 * NMI, IRQ or process context.
> >> +	 * Set CT_RCU_WATCHING here and let the cmpxchg do the check for us;
> >> +	 * the state could change between the atomic_read() and the cmpxchg().
> >> +	 */
> >> +	old |= CT_RCU_WATCHING;
> >
> > Most of the time, the task should be either idle or in userspace. I'm still not
> > sure why you start with a bet that the CPU is in the kernel with RCU watching.
> >
> 
> Right I think I got that the wrong way around when I switched to using
> CT_RCU_WATCHING vs CT_STATE_KERNEL. That wants to be
> 
>   old &= ~CT_RCU_WATCHING;
> 
> i.e. bet the CPU is NOHZ-idle, if it's not the cmpxchg fails and we don't
> store the work bit.

Right.

> 
> >> +	/*
> >> +	 * Try setting the work until either
> >> +	 * - the target CPU has entered kernelspace
> >> +	 * - the work has been set
> >> +	 */
> >> +	do {
> >> +		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
> >> +	} while (!ret && !(old & CT_RCU_WATCHING));
> >
> > So this applies blindly to idle as well, right? It should work but note that
> > idle entry code before RCU watches is also fragile.
> >
> 
> Yeah I remember losing some hair trying to grok the idle entry situation;
> we could keep this purely NOHZ_FULL and have the deferral condition be:
> 
>   (ct->state & CT_STATE_USER) && !(ct->state & CT_RCU_WATCHING)

Well, after all what works for NOHZ_FULL should also work for idle. It's
preceded by entry code as well (or rather __cpuidle).

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2025-10-29 10:32   ` Valentin Schneider
@ 2025-10-29 17:15     ` Frederic Weisbecker
  0 siblings, 0 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2025-10-29 17:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Phil Auld, linux-kernel, linux-mm, rcu, x86, linux-arm-kernel,
	loongarch, linux-riscv, linux-arch, linux-trace-kernel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Sami Tolvanen, David S. Miller, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Mel Gorman, Andrew Morton, Masahiro Yamada,
	Han Shen, Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov,
	Juri Lelli, Clark Williams, Yair Podemsky, Marcelo Tosatti,
	Daniel Wagner, Petr Tesarik

Le Wed, Oct 29, 2025 at 11:32:58AM +0100, Valentin Schneider a écrit :
> I need to have a think about that one; one pain point I see is the context
> tracking work has to be NMI safe since e.g. an NMI can take us out of
> userspace. Another is that NOHZ-full CPUs need to be special cased in the
> stop machine queueing / completion.
> 
> /me goes fetch a new notebook

Something like the below (untested) ?

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 485b32881fde..2940e28ecea6 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <linux/stop_machine.h>
 
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work)
 	case CT_WORK_SYNC:
 		sync_core();
 		break;
+	case CT_WORK_STOP_MACHINE:
+		stop_machine_poll_wait();
+		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
 	}
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 2facc621be06..b63200bd73d6 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,12 +6,14 @@
 
 enum {
 	CT_WORK_SYNC_OFFSET,
+	CT_WORK_STOP_MACHINE_OFFSET,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
-	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
-	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
+	CT_WORK_SYNC         = BIT(CT_WORK_SYNC_OFFSET),
+	CT_WORK_STOP_MACHINE = BIT(CT_WORK_STOP_MACHINE_OFFSET),
+	CT_WORK_MAX          = BIT(CT_WORK_MAX_OFFSET)
 };
 
 #include <asm/context_tracking_work.h>
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 72820503514c..0efe88e84b8a 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -36,6 +36,7 @@ bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 void stop_machine_park(int cpu);
 void stop_machine_unpark(int cpu);
 void stop_machine_yield(const struct cpumask *cpumask);
+void stop_machine_poll_wait(void);
 
 extern void print_stop_info(const char *log_lvl, struct task_struct *task);
 
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 3fe6b0c99f3d..8f0281b0db64 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -22,6 +22,7 @@
 #include <linux/atomic.h>
 #include <linux/nmi.h>
 #include <linux/sched/wake_q.h>
+#include <linux/sched/isolation.h>
 
 /*
  * Structure to determine completion condition and record errors.  May
@@ -176,6 +177,68 @@ struct multi_stop_data {
 	atomic_t		thread_ack;
 };
 
+static DEFINE_PER_CPU(int, stop_machine_poll);
+
+void stop_machine_poll_wait(void)
+{
+	int *poll = this_cpu_ptr(&stop_machine_poll);
+
+	while (*poll)
+		cpu_relax();
+	/* Enforce the work in stop machine to be visible */
+	smp_mb();
+}
+
+static void stop_machine_poll_start(struct multi_stop_data *msdata)
+{
+	int cpu;
+
+	if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+		return;
+
+	/* Random target can't be known in advance */
+	if (!msdata->active_cpus)
+		return;
+		
+	for_each_cpu_andnot(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)) {
+		int *poll = per_cpu_ptr(&stop_machine_poll, cpu);
+
+		if (cpumask_test_cpu(cpu, msdata->active_cpus))
+			continue;
+
+		*poll = 1;
+
+		/*
+		 * Act as a full barrier so that if the work is queued, polling is
+		 * visible.
+		 */
+		if (ct_set_cpu_work(cpu, CT_WORK_STOP_MACHINE))
+			msdata->num_threads--;
+		else
+			*poll = 0;
+	}
+}
+
+static void stop_machine_poll_complete(struct multi_stop_data *msdata)
+{
+	int cpu;
+
+	if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+		return;
+
+	for_each_cpu_andnot(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)) {
+		int *poll = per_cpu_ptr(&stop_machine_poll, cpu);
+
+		if (cpumask_test_cpu(cpu, msdata->active_cpus))
+			continue;
+		/*
+		 * The RmW in ack_state() fully orders the work performed in stop_machine()
+		 * with polling.
+		 */
+		*poll = 0;
+	}
+}
+
 static void set_state(struct multi_stop_data *msdata,
 		      enum multi_stop_state newstate)
 {
@@ -186,10 +249,13 @@ static void set_state(struct multi_stop_data *msdata,
 }
 
 /* Last one to ack a state moves to the next state. */
-static void ack_state(struct multi_stop_data *msdata)
+static bool ack_state(struct multi_stop_data *msdata)
 {
-	if (atomic_dec_and_test(&msdata->thread_ack))
+	if (atomic_dec_and_test(&msdata->thread_ack)) {
 		set_state(msdata, msdata->state + 1);
+		return true;
+	}
+	return false;
 }
 
 notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
@@ -240,7 +306,8 @@ static int multi_cpu_stop(void *data)
 			default:
 				break;
 			}
-			ack_state(msdata);
+			if (ack_state(msdata) && msdata->state == MULTI_STOP_EXIT)
+				stop_machine_poll_complete(msdata);
 		} else if (curstate > MULTI_STOP_PREPARE) {
 			/*
 			 * At this stage all other CPUs we depend on must spin
@@ -615,6 +682,8 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
 		return ret;
 	}
 
+	stop_machine_poll_start(&msdata);
+
 	/* Set the initial state and stop all online cpus. */
 	set_state(&msdata, MULTI_STOP_PREPARE);
 	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 06/29] static_call: Add read-only-after-init static calls
  2025-10-10 15:38 ` [PATCH v6 06/29] static_call: Add read-only-after-init static calls Valentin Schneider
@ 2025-10-30 10:25   ` Petr Tesarik
  2025-10-31 11:52     ` Valentin Schneider
  0 siblings, 1 reply; 54+ messages in thread
From: Petr Tesarik @ 2025-10-30 10:25 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Josh Poimboeuf,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner

On Fri, 10 Oct 2025 17:38:16 +0200
Valentin Schneider <vschneid@redhat.com> wrote:

> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Deferring a code patching IPI is unsafe if the patched code is in a
> noinstr region.  In that case the text poke code must trigger an
> immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
> CPU running in userspace.
> 
> If a noinstr static call only needs to be patched during boot, its key
> can be made ro-after-init to ensure it will never be patched at runtime.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/static_call.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> index 78a77a4ae0ea8..ea6ca57e2a829 100644
> --- a/include/linux/static_call.h
> +++ b/include/linux/static_call.h
> @@ -192,6 +192,14 @@ extern long __static_call_return0(void);
>  	};								\
>  	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
>  
> +#define DEFINE_STATIC_CALL_RO(name, _func)				\
> +	DECLARE_STATIC_CALL(name, _func);				\
> +	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
> +		.func = _func,						\
> +		.type = 1,						\
> +	};								\
> +	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
> +
>  #define DEFINE_STATIC_CALL_NULL(name, _func)				\
>  	DECLARE_STATIC_CALL(name, _func);				\
>  	struct static_call_key STATIC_CALL_KEY(name) = {		\
> @@ -200,6 +208,14 @@ extern long __static_call_return0(void);
>  	};								\
>  	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
>  
> +#define DEFINE_STATIC_CALL_NULL_RO(name, _func)				\
> +	DECLARE_STATIC_CALL(name, _func);				\
> +	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
> +		.func = NULL,						\
> +		.type = 1,						\
> +	};								\
> +	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
> +

I think it would be a good idea to add a comment describing when these
macros are supposed to be used, similar to the explanation you wrote for
the _NOINSTR variants. Just to provide a clue for people adding a new
static key in the future, because the commit message may become a bit
hard to find if there are a few cleanup patches on top.

Just my two cents,
Petr T

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
  2025-10-29 14:49           ` Frederic Weisbecker
@ 2025-10-31  9:55             ` Valentin Schneider
  0 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-31  9:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Paolo Bonzini, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Sami Tolvanen,
	David S. Miller, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman,
	Andrew Morton, Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn,
	Dan Carpenter, Oleg Nesterov, Juri Lelli, Clark Williams,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik

On 29/10/25 15:49, Frederic Weisbecker wrote:
> Le Wed, Oct 29, 2025 at 03:13:59PM +0100, Valentin Schneider a écrit :
>> Given we have ALTERNATIVE's in there I assume something like a
>> boot-time-driven static key could do, but I haven't found out yet if and
>> how that can be shoved in an ASM file.
>
> Right, I thought I had seen static keys in ASM already but I can't find it
> anymore. arch/x86/include/asm/jump_label.h is full of reusable magic
> though.
>

I got something ugly that /seems/ to work, now to spend twice the time to
clean it up :-)

> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 06/29] static_call: Add read-only-after-init static calls
  2025-10-30 10:25   ` Petr Tesarik
@ 2025-10-31 11:52     ` Valentin Schneider
  0 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2025-10-31 11:52 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: linux-kernel, linux-mm, rcu, x86, linux-arm-kernel, loongarch,
	linux-riscv, linux-arch, linux-trace-kernel, Josh Poimboeuf,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Yair Podemsky,
	Marcelo Tosatti, Daniel Wagner

On 30/10/25 11:25, Petr Tesarik wrote:
> On Fri, 10 Oct 2025 17:38:16 +0200
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>> From: Josh Poimboeuf <jpoimboe@kernel.org>
>>
>> Deferring a code patching IPI is unsafe if the patched code is in a
>> noinstr region.  In that case the text poke code must trigger an
>> immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
>> CPU running in userspace.
>>
>> If a noinstr static call only needs to be patched during boot, its key
>> can be made ro-after-init to ensure it will never be patched at runtime.
>>
>> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
>> ---
>>  include/linux/static_call.h | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
>> index 78a77a4ae0ea8..ea6ca57e2a829 100644
>> --- a/include/linux/static_call.h
>> +++ b/include/linux/static_call.h
>> @@ -192,6 +192,14 @@ extern long __static_call_return0(void);
>>      };								\
>>      ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
>>
>> +#define DEFINE_STATIC_CALL_RO(name, _func)				\
>> +	DECLARE_STATIC_CALL(name, _func);				\
>> +	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
>> +		.func = _func,						\
>> +		.type = 1,						\
>> +	};								\
>> +	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
>> +
>>  #define DEFINE_STATIC_CALL_NULL(name, _func)				\
>>      DECLARE_STATIC_CALL(name, _func);				\
>>      struct static_call_key STATIC_CALL_KEY(name) = {		\
>> @@ -200,6 +208,14 @@ extern long __static_call_return0(void);
>>      };								\
>>      ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
>>
>> +#define DEFINE_STATIC_CALL_NULL_RO(name, _func)				\
>> +	DECLARE_STATIC_CALL(name, _func);				\
>> +	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
>> +		.func = NULL,						\
>> +		.type = 1,						\
>> +	};								\
>> +	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
>> +
>
> I think it would be a good idea to add a comment describing when these
> macros are supposed to be used, similar to the explanation you wrote for
> the _NOINSTR variants. Just to provide a clue for people adding a new
> static key in the future, because the commit message may become a bit
> hard to find if there are a few cleanup patches on top.
>

I was about to write such a comment but I had another take; The _NOINSTR
static key helpers are special and only relevant to IPI deferral; whereas
the _RO helpers actually change the backing storage for the keys and as a
bonus are used by the IPI deferral instrumentation.

IMO it's the same here for the static calls, it makes sense to mark the
relevant ones as _RO regardless of IPI deferral.

I could however add a comment to ANNOTATE_NOINSTR_ALLOWED() itself,
something like:

```
/*
 * This is used to tell objtool that a given static key is safe to be used
 * within .noinstr code, and it doesn't need to generate a warning about it.
 *
 * For more information, see tools/objtool/Documentation/objtool.txt,
 * "non-RO static key usage in noinstr code"
 */
#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key)
```

> Just my two cents,
> Petr T


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2025-10-31 11:53 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-10 15:38 [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 01/29] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 02/29] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 03/29] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 04/29] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 05/29] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 06/29] static_call: Add read-only-after-init static calls Valentin Schneider
2025-10-30 10:25   ` Petr Tesarik
2025-10-31 11:52     ` Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 07/29] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 08/29] x86/idle: Mark x86_idle " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 09/29] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 10/29] riscv/paravirt: " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 11/29] loongarch/paravirt: " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 12/29] arm64/paravirt: " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 13/29] arm/paravirt: " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 14/29] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 15/29] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 16/29] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
2025-10-14  0:02   ` Sean Christopherson
2025-10-14 11:20     ` Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 17/29] x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in .noinstr Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 18/29] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 19/29] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-10-14  0:01   ` Sean Christopherson
2025-10-14 11:02     ` Valentin Schneider
2025-10-14 19:06       ` Sean Christopherson
2025-10-10 15:38 ` [PATCH v6 20/29] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 21/29] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 22/29] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
2025-10-10 15:38 ` [PATCH v6 23/29] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-10-28 14:00   ` Frederic Weisbecker
2025-10-29 10:09     ` Valentin Schneider
2025-10-29 14:52       ` Frederic Weisbecker
2025-10-10 15:38 ` [PATCH v6 24/29] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-10-28 14:49   ` Frederic Weisbecker
2025-10-10 15:38 ` [PATCH v6 25/29] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
2025-10-10 15:38 ` [RFC PATCH v6 26/29] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2025-10-10 15:38 ` [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
2025-10-28 15:59   ` Frederic Weisbecker
2025-10-29 10:16     ` Valentin Schneider
2025-10-29 10:31       ` Frederic Weisbecker
2025-10-29 14:13         ` Valentin Schneider
2025-10-29 14:49           ` Frederic Weisbecker
2025-10-31  9:55             ` Valentin Schneider
2025-10-10 15:38 ` [RFC PATCH v6 28/29] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
2025-10-10 15:38 ` [RFC PATCH v6 29/29] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
2025-10-14 12:58 ` [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition Juri Lelli
2025-10-14 15:26   ` Valentin Schneider
2025-10-15 13:16     ` Valentin Schneider
2025-10-15 14:28       ` Juri Lelli
2025-10-28 16:25 ` Frederic Weisbecker
2025-10-29 10:32   ` Valentin Schneider
2025-10-29 17:15     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).