linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2025-01-14 17:51 Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
                   ` (29 more replies)
  0 siblings, 30 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      enter_from_user_mode()
	__ct_user_exit()
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-19 use said annotations on relevant keys / calls.
o Patch 20 enforces proper usage of said annotations (courtesy of Josh).

o Patches 21-23 fiddle with CT_STATE* within context tracking

o Patches 24-29 add the actual IPI deferral faff

o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much!
  if you care about battery-powered devices and energy consumption, go give it
  a try!

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4

Stuff I'd like eyes and neurons on
==================================

Context-tracking vs idle. Patch 22 "improves" the situation by adding an
IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the
following window:

  ~> IRQ
  ct_nmi_enter()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

  [...]

  ct_nmi_exit()
    state = state - CT_STATE_KERNEL + CT_STATE_IDLE

  [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\

  ct_cpuidle_exit()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

Said window is contained within cpu_idle_poll() and the cpuidle call within
cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is
noinstr itself). Thus objtool will consider it as early entry and will warn
accordingly of any static key / call misuse, so the damage is somewhat
contained, but it's not ideal.

I tried fiddling with this but idle polling likes being annoying, as it is
shaped like so:

	ct_cpuidle_enter();

	raw_local_irq_enable();
	while (!tif_need_resched() &&
	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
		cpu_relax();
	raw_local_irq_disable();

	ct_cpuidle_exit();

IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling
doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost
ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE).

Testing
=======

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.13-rc6
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
    531 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
  12818 func=do_flush_tlb_all
    910 func=do_kernel_range_flush
     78 func=do_sync_core

v6.13-rc6 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
# Zilch!

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
# Nada!

Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has
gone as mysteriously as it showed up. Yair had a series adressing this [5] which
would be worth revisiting.

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

RFCv3 -> v4
++++++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Peter Zijlstra (1):
  x86,tlb: Make __flush_tlb_global() noinstr-compliant

Valentin Schneider (26):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
  context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  context_tracking: Turn CT_STATE_* into bits
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  x86/tlb: Make __flush_tlb_local() noinstr-compliant
  x86/tlb: Make __flush_tlb_all() noinstr
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs
  context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  22 ++++
 arch/x86/include/asm/invpcid.h                |  13 +--
 arch/x86/include/asm/paravirt.h               |   4 +-
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/xen/hypercall.h          |  11 +-
 arch/x86/kernel/alternative.c                 |  38 ++++++-
 arch/x86/kernel/cpu/bugs.c                    |   9 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/mm/tlb.c                             |  46 ++++++--
 arch/x86/xen/mmu_pv.c                         |  10 +-
 arch/x86/xen/xen-ops.h                        |  12 +-
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  64 +++++++++--
 include/linux/context_tracking_work.h         |  28 +++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/objtool.h                       |   7 ++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  98 ++++++++++++++--
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/stackleak.c                            |   6 +-
 kernel/time/Kconfig                           |  19 ++++
 mm/vmalloc.c                                  |  35 +++++-
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  18 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 44 files changed, 635 insertions(+), 107 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.43.0


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[]
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.

While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.

Fabricating an intentional warning shows the change; previously:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section

now:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 76060da755b5c..b35763f05a548 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
 	return false;
 }
 
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn)
 {
 	static char pvname[19];
 	struct reloc *reloc;
@@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct instruction *insn)
 	if (insn_call_dest(insn))
 		return insn_call_dest(insn)->name;
 
-	reloc = insn_reloc(NULL, insn);
+	reloc = insn_reloc(file, insn);
 	if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
-		idx = (reloc_addend(reloc) / sizeof(void *));
+		idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *));
 		snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
 		return pvname;
 	}
@@ -3538,17 +3538,19 @@ static int validate_call(struct objtool_file *file,
 {
 	if (state->noinstr && state->instr <= 0 &&
 	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with UACCESS enabled",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
 	if (state->df) {
-		WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn));
+		WARN_INSN(insn, "call to %s() with DF set",
+			  call_dest_name(file, insn));
 		return 1;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index b35763f05a548..6aa9259fc9940 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 
 	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
 		if (!target->sec->noinstr) {
-			WARN("pv_ops[%d]: %s", idx, target->name);
+			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
 			file->pv_ops[idx].clean = false;
 		}
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-21 13:56   ` Frederic Weisbecker
  2025-01-14 17:51 ` [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Paul E . McKenney, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.

Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++----
 kernel/rcu/Kconfig.debug               | 15 +++++++++
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 7b8433d5a8efe..0b81248aa03e2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -18,12 +18,6 @@ enum ctx_state {
 	CT_STATE_MAX		= 4,
 };
 
-/* Odd value for watching, else even. */
-#define CT_RCU_WATCHING CT_STATE_MAX
-
-#define CT_STATE_MASK (CT_STATE_MAX - 1)
-#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
-
 struct context_tracking {
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 	/*
@@ -44,9 +38,45 @@ struct context_tracking {
 #endif
 };
 
+/*
+ * We cram two different things within the same atomic variable:
+ *
+ *                     CT_RCU_WATCHING_START  CT_STATE_START
+ *                                |                |
+ *                                v                v
+ *     MSB [ RCU watching counter ][ context_state ] LSB
+ *         ^                       ^
+ *         |                       |
+ * CT_RCU_WATCHING_END        CT_STATE_END
+ *
+ * Bits are used from the LSB upwards, so unused bits (if any) will always be in
+ * upper bits of the variable.
+ */
 #ifdef CONFIG_CONTEXT_TRACKING
+#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1)
+#define CT_STATE_START 0
+#define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
+#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
+#define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
+
+#define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
+
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+
+static_assert(CT_STATE_WIDTH        +
+	      CT_RCU_WATCHING_WIDTH +
+	      CT_UNUSED_WIDTH       ==
+	      CT_SIZE);
+
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
-#endif
+#endif	/* CONFIG_CONTEXT_TRACKING */
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 static __always_inline int __ct_state(void)
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 9b0b52e1836fa..ea36953803a1e 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -168,4 +168,19 @@ config RCU_STRICT_GRACE_PERIOD
 	  when looking for certain types of RCU usage bugs, for example,
 	  too-short RCU read-side critical sections.
 
+
+config RCU_DYNTICKS_TORTURE
+	bool "Minimize RCU dynticks counter size"
+	depends on RCU_EXPERT && !COMPILE_TEST
+	default n
+	help
+	  This option sets the width of the dynticks counter to its
+	  minimum usable value.  This minimum width greatly increases
+	  the probability of flushing out bugs involving counter wrap,
+	  but it also increases the probability of extending grace period
+	  durations.  This Kconfig option should therefore be avoided in
+	  production due to the consequent increased probability of OOMs.
+
+	  This has no value for production and is only for testing.
+
 endmenu # "RCU Debugging"
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (2 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-21 14:00   ` Frederic Weisbecker
  2025-01-14 17:51 ` [PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage Valentin Schneider
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Paul E . McKenney, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

We now have an RCU_EXPERT config for testing small-sized RCU dynticks
counter:  CONFIG_RCU_DYNTICKS_TORTURE.

Modify scenario TREE04 to exercise to use this config in order to test a
ridiculously small counter (2 bits).

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..67caf4276bb01 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (3 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 06/30] static_call: Add read-only-after-init static calls Valentin Schneider
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

Some noinstr static branches may really need to be patched at runtime,
despite the resulting disruption.  Add DEFINE_STATIC_KEY_*_NOINSTR()
variants for those.  They don't do anything special yet; that will come
later.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/jump_label.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f5a2727ca4a9a..88bb6e32fdcbc 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -385,6 +385,23 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE_RO(name)	\
 	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
 
+/*
+ * The _NOINSTR variants are used to tell objtool the static key is allowed to
+ * be used in noinstr code.
+ *
+ * They should almost never be used, as they prevent code patching IPIs from
+ * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running
+ * in pure userspace.
+ *
+ * If using one of these _NOINSTR variants, please add a comment above the
+ * definition with the rationale.
+ */
+#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_TRUE(name)
+
+#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
+	DEFINE_STATIC_KEY_FALSE(name)
+
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 06/30] static_call: Add read-only-after-init static calls
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (4 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

From: Josh Poimboeuf <jpoimboe@kernel.org>

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

If a noinstr static call only needs to be patched during boot, its key
can be made ro-after-init to ensure it will never be patched at runtime.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/static_call.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 78a77a4ae0ea8..ea6ca57e2a829 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -192,6 +192,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
+#define DEFINE_STATIC_CALL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = _func,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
@@ -200,6 +208,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func)				\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+		.func = NULL,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (5 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 06/30] static_call: Add read-only-after-init static calls Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 08/30] x86/idle: Mark x86_idle " Valentin Schneider
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

pv_sched_clock is updated in:
o __init vmware_paravirt_ops_setup()
o __init xen_init_time_common()
o kvm_sched_clock_init() <- __init kvmclock_init()
o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..ae6675167877b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -73,7 +73,7 @@ static u64 native_steal_clock(int cpu)
 }
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
-DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 08/30] x86/idle: Mark x86_idle static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (6 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

x86_idle is updated in:
o xen_set_default_idle() <- __init xen_arch_setup()
o __init select_idle_routine()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91f..85cd62f61d633 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -746,7 +746,7 @@ void __cpuidle default_idle(void)
 EXPORT_SYMBOL(default_idle);
 #endif
 
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle);
+DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
 
 static bool x86_idle_set(void)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (7 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 08/30] x86/idle: Mark x86_idle " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 10/30] riscv/paravirt: " Valentin Schneider
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

The static call is only ever updated in

  __init pv_time_init()
  __init xen_init_time_common()
  __init vmware_paravirt_ops_setup()
  __init xen_time_setup_guest(

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ae6675167877b..a19b0002b18d8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -72,7 +72,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (8 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 18:29   ` Andrew Jones
  2025-01-14 17:51 ` [PATCH v4 11/30] loongarch/paravirt: " Valentin Schneider
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

The static call is only ever updated in:

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/riscv/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
index fa6b0339a65de..dfe8808016fd8 100644
--- a/arch/riscv/kernel/paravirt.c
+++ b/arch/riscv/kernel/paravirt.c
@@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 11/30] loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (9 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 10/30] riscv/paravirt: " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 12/30] arm64/paravirt: " Valentin Schneider
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/loongarch/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index e5a39bbad0780..b011578d3e931 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -20,7 +20,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 12/30] arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (10 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 11/30] loongarch/paravirt: " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 13/30] arm/paravirt: " Valentin Schneider
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm64/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index aa718d6a9274a..ad28fa23c9228 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 struct pv_time_stolen_time_region {
 	struct pvclock_vcpu_stolen_time __rcu *kaddr;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 13/30] arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (11 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 12/30] arm64/paravirt: " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

The static call is only ever updated in

  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/arm/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c
index 7dd9806369fb0..632d8d5e06db3 100644
--- a/arch/arm/kernel/paravirt.c
+++ b/arch/arm/kernel/paravirt.c
@@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu)
 	return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (12 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 13/30] arm/paravirt: " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 15/30] sched/clock: Mark sched_clock_running key " Valentin Schneider
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init
amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/events/amd/brs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 780acd3dff22a..e2ff03af15d82 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -422,7 +422,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in)
 	}
 }
 
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
+DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
 EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
 
 void __init amd_brs_lopwr_init(void)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 15/30] sched/clock: Mark sched_clock_running key as __ro_after_init
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (13 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr Valentin Schneider
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

sched_clock_running is only ever enabled in the __init functions
sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark
it __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index a09655b481402..200e5568b9894 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
+static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (14 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 15/30] sched/clock: Mark sched_clock_running key " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

mds_idle_clear is used in .noinstr code, and can be modified at
runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little
benefits for this key, as hotplug implies eventually going through
takedown_cpu() -> stop_machine_cpuslocked() which is going to cause
interference on all online CPUs anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/cpu/bugs.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 47a01d4028f60..acad84dcfc3cd 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -113,8 +113,13 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 /* Control unconditional IBPB in switch_mm() */
 DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
-/* Control MDS CPU buffer clear before idling (halt, mwait) */
-DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+/*
+ * Control MDS CPU buffer clear before idling (halt, mwait)
+ *
+ * Allowed in .noinstr as this key is updated during hotplug which comes with
+ * more interference than just the text patching IPI.
+ */
+DEFINE_STATIC_KEY_FALSE_NOINSTR(mds_idle_clear);
 EXPORT_SYMBOL_GPL(mds_idle_clear);
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (15 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

__sched_clock_stable is used in .noinstr code, and can be modified at
runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync()
IPI has little benefits for this key, as NOHZ_FULL is incompatible with an
unstable TSC anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/clock.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 200e5568b9894..e59986bc14a43 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -75,8 +75,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
  *
  * Similarly we start with __sched_clock_stable_early, thereby assuming we
  * will become stable, such that there's only a single 1 -> 0 transition.
+ *
+ * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL,
+ * thus the text patching IPI would be the least of our concerns.
  */
-static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable);
 static int __sched_clock_stable_early = 1;
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (16 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 21:19   ` Sean Christopherson
  2025-01-14 17:51 ` [PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

These keys are used in .noinstr code, and can be modified at runtime
(/proc/kernel/vmx* write). However it is not expected that they will be
flipped during latency-sensitive operations, and thus shouldn't be a source
of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 893366e537322..a028c38f44e02 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -225,8 +225,15 @@ module_param(pt_mode, int, S_IRUGO);
 
 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
 
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
+/*
+ * Both of these static keys end up being used in .noinstr sections, however
+ * they are only modified:
+ * - at init
+ * - from a /proc/kernel/vmx* write
+ * thus during latency-sensitive operations they should remain stable.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (17 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls Valentin Schneider
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

stack_erasing_bypass is used in .noinstr code, and can be modified at runtime
(proc/sys/kernel/stack_erasing write). However it is not expected that it
will be  flipped during latency-sensitive operations, and thus shouldn't be
a source of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/stackleak.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index 39fd620a7db6f..a4f07fbc13f61 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -18,7 +18,11 @@
 #include <linux/sysctl.h>
 #include <linux/init.h>
 
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass);
+/*
+ * This static key can only be modified via its sysctl interface. It is
+ * expected it will remain stable during latency-senstive operations.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
 
 #ifdef CONFIG_SYSCTL
 static int stack_erasing_sysctl(const struct ctl_table *table, int write,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (18 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing Valentin Schneider
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Josh Poimboeuf, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

From: Josh Poimboeuf <jpoimboe@kernel.org>

Warn about static branches/calls in noinstr regions, unless the
corresponding key is RO-after-init or has been manually whitelisted with
DEFINE_STATIC_KEY_*_NOINSTR(().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
[Added NULL check for insn_call_dest() return value]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/jump_label.h              | 17 +++--
 include/linux/objtool.h                 |  7 ++
 include/linux/static_call.h             |  3 +
 tools/objtool/Documentation/objtool.txt | 34 +++++++++
 tools/objtool/check.c                   | 92 ++++++++++++++++++++++---
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/elf.h     |  1 +
 tools/objtool/include/objtool/special.h |  1 +
 tools/objtool/special.c                 | 18 ++++-
 9 files changed, 156 insertions(+), 18 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 88bb6e32fdcbc..dc8a82a62c109 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -75,6 +75,7 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/objtool.h>
 
 extern bool static_key_initialized;
 
@@ -373,8 +374,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_TRUE(name)	\
 	struct static_key_true name = STATIC_KEY_TRUE_INIT
 
-#define DEFINE_STATIC_KEY_TRUE_RO(name)	\
-	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_RO(name)						\
+	struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_TRUE(name)	\
 	extern struct static_key_true name
@@ -382,8 +384,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE(name)	\
 	struct static_key_false name = STATIC_KEY_FALSE_INIT
 
-#define DEFINE_STATIC_KEY_FALSE_RO(name)	\
-	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_RO(name)					\
+	struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT;	\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 /*
  * The _NOINSTR variants are used to tell objtool the static key is allowed to
@@ -397,10 +400,12 @@ struct static_key_false {
  * definition with the rationale.
  */
 #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_TRUE(name)
+	DEFINE_STATIC_KEY_TRUE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)					\
-	DEFINE_STATIC_KEY_FALSE(name)
+	DEFINE_STATIC_KEY_FALSE(name);						\
+	ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_FALSE(name)	\
 	extern struct static_key_false name
diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index b3b8d3dab52d5..1a7389f273063 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -34,6 +34,12 @@
 	static void __used __section(".discard.func_stack_frame_non_standard") \
 		*__func_stack_frame_non_standard_##func = func
 
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \
+	static void __used __section(".discard.noinstr_allowed") \
+		*__annotate_noinstr_allowed_##key = &key
+
+#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key)
+
 /*
  * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore
  * for the case where a function is intentionally missing frame pointer setup,
@@ -157,6 +163,7 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define ANNOTATE_NOENDBR
 #define ASM_REACHABLE
+#define ANNOTATE_NOINSTR_ALLOWED(key)
 #else
 #define ANNOTATE_INTRA_FUNCTION_CALL
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index ea6ca57e2a829..0d4b16d348501 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -133,6 +133,7 @@
 
 #include <linux/types.h>
 #include <linux/cpu.h>
+#include <linux/objtool.h>
 #include <linux/static_call_types.h>
 
 #ifdef CONFIG_HAVE_STATIC_CALL
@@ -198,6 +199,7 @@ extern long __static_call_return0(void);
 		.func = _func,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)				\
@@ -214,6 +216,7 @@ extern long __static_call_return0(void);
 		.func = NULL,						\
 		.type = 1,						\
 	};								\
+	ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));		\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt
index 7c3ee959b63c7..922d3b41541d0 100644
--- a/tools/objtool/Documentation/objtool.txt
+++ b/tools/objtool/Documentation/objtool.txt
@@ -447,6 +447,40 @@ the objtool maintainers.
    names and does not use module_init() / module_exit() macros to create
    them.
 
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in noinstr code
+    file.o: warning: func()+0x2a: key: non-RO static call usage in noinstr code
+
+  This means that noinstr function func() uses a static key or
+  static call named 'key' which can be modified at runtime.  This is
+  discouraged because it prevents code patching IPIs from being
+  deferred.
+
+  You have the following options:
+
+  1) Check whether the static key/call in question is only modified
+     during init.  If so, define it as read-only-after-init with
+     DEFINE_STATIC_KEY_*_RO() or DEFINE_STATIC_CALL_RO().
+
+  2) Avoid the runtime patching.  For static keys this can be done by
+     using static_key_enabled() or by getting rid of the static key
+     altogether if performance is not a concern.
+
+     For static calls, something like the following could be done:
+
+       target = static_call_query(foo);
+       if (target == func1)
+	       func1();
+	else if (target == func2)
+		func2();
+	...
+
+  3) Silence the warning by defining the static key/call with
+     DEFINE_STATIC_*_NOINSTR().  This decision should not
+     be taken lightly as it may result in code patching IPIs getting
+     sent to isolated NOHZ_FULL CPUs running in pure userspace.  A
+     comment should be added above the definition explaining the
+     rationale for the decision.
+
 
 If the error doesn't seem to make sense, it could be a bug in objtool.
 Feel free to ask the objtool maintainer for help.
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 6aa9259fc9940..24219538c1587 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1068,6 +1068,45 @@ static int create_direct_call_sections(struct objtool_file *file)
 	return 0;
 }
 
+static int read_noinstr_allowed(struct objtool_file *file)
+{
+	struct section *rsec;
+	struct symbol *sym;
+	struct reloc *reloc;
+
+	rsec = find_section_by_name(file->elf, ".rela.discard.noinstr_allowed");
+	if (!rsec)
+		return 0;
+
+	for_each_reloc(rsec, reloc) {
+		switch (reloc->sym->type) {
+		case STT_OBJECT:
+		case STT_FUNC:
+			sym = reloc->sym;
+			break;
+
+		case STT_SECTION:
+			sym = find_symbol_by_offset(reloc->sym->sec,
+						    reloc_addend(reloc));
+			if (!sym) {
+				WARN_FUNC("can't find static key/call symbol",
+					  reloc->sym->sec, reloc_addend(reloc));
+				return -1;
+			}
+			break;
+
+		default:
+			WARN("unexpected relocation symbol type in %s: %d",
+			     rsec->name, reloc->sym->type);
+			return -1;
+		}
+
+		sym->noinstr_allowed = 1;
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -1955,6 +1994,8 @@ static int handle_jump_alt(struct objtool_file *file,
 		return -1;
 	}
 
+	orig_insn->key = special_alt->key;
+
 	if (opts.hack_jump_label && special_alt->key_addend & 2) {
 		struct reloc *reloc = insn_reloc(file, orig_insn);
 
@@ -2731,6 +2772,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_noinstr_allowed(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -3494,9 +3539,9 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 	return file->pv_ops[idx].clean;
 }
 
-static inline bool noinstr_call_dest(struct objtool_file *file,
-				     struct instruction *insn,
-				     struct symbol *func)
+static inline bool noinstr_call_allowed(struct objtool_file *file,
+					struct instruction *insn,
+					struct symbol *func)
 {
 	/*
 	 * We can't deal with indirect function calls at present;
@@ -3516,10 +3561,10 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 		return true;
 
 	/*
-	 * If the symbol is a static_call trampoline, we can't tell.
+	 * Only DEFINE_STATIC_CALL_*_RO allowed.
 	 */
 	if (func->static_call_tramp)
-		return true;
+		return func->noinstr_allowed;
 
 	/*
 	 * The __ubsan_handle_*() calls are like WARN(), they only happen when
@@ -3532,14 +3577,29 @@ static inline bool noinstr_call_dest(struct objtool_file *file,
 	return false;
 }
 
+static char *static_call_name(struct symbol *func)
+{
+	return func->name + strlen("__SCT__");
+}
+
 static int validate_call(struct objtool_file *file,
 			 struct instruction *insn,
 			 struct insn_state *state)
 {
-	if (state->noinstr && state->instr <= 0 &&
-	    !noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-		WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
-		return 1;
+	if (state->noinstr && state->instr <= 0) {
+		struct symbol *dest = insn_call_dest(insn);
+
+		if (dest && dest->static_call_tramp) {
+			if (!dest->noinstr_allowed) {
+				WARN_INSN(insn, "%s: non-RO static call usage in noinstr",
+					  static_call_name(dest));
+			}
+
+		} else if (dest && !noinstr_call_allowed(file, insn, dest)) {
+			WARN_INSN(insn, "call to %s() leaves .noinstr.text section",
+				  call_dest_name(file, insn));
+			return 1;
+		}
 	}
 
 	if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
@@ -3604,6 +3664,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
 	return 0;
 }
 
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+	if (state->noinstr && state->instr <= 0 && !insn->key->noinstr_allowed) {
+		WARN_INSN(insn, "%s: non-RO static key usage in noinstr",
+			  insn->key->name);
+		return 1;
+	}
+
+	return 0;
+}
+
 static struct instruction *next_insn_to_validate(struct objtool_file *file,
 						 struct instruction *insn)
 {
@@ -3765,6 +3836,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 		if (handle_insn_ops(insn, next_insn, &state))
 			return 1;
 
+		if (insn->key)
+			validate_static_key(insn, &state);
+
 		switch (insn->type) {
 
 		case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index daa46f1f0965a..c0da7246eac7b 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -77,6 +77,7 @@ struct instruction {
 	struct symbol *sym;
 	struct stack_op *stack_ops;
 	struct cfi_state *cfi;
+	struct symbol *key;
 };
 
 static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h
index d7e815c2fd156..0cb79931262bb 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -69,6 +69,7 @@ struct symbol {
 	u8 embedded_insn     : 1;
 	u8 local_label       : 1;
 	u8 frame_pointer     : 1;
+	u8 noinstr_allowed   : 1;
 	struct list_head pv_target;
 	struct reloc *relocs;
 };
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 86d4af9c5aa9d..ce4759358ec48 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -20,6 +20,7 @@ struct special_alt {
 	bool skip_alt;
 	bool jump_or_nop;
 	u8 key_addend;
+	struct symbol *key;
 
 	struct section *orig_sec;
 	unsigned long orig_off;
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 097a69db82a0e..982d5cb55e1bb 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -119,14 +119,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
 
 	if (entry->key) {
 		struct reloc *key_reloc;
+		struct symbol *key;
+		s64 key_addend;
 
 		key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key);
 		if (!key_reloc) {
-			WARN_FUNC("can't find key reloc",
-				  sec, offset + entry->key);
+			WARN_FUNC("can't find key reloc", sec, offset + entry->key);
 			return -1;
 		}
-		alt->key_addend = reloc_addend(key_reloc);
+
+		key = key_reloc->sym;
+		key_addend = reloc_addend(key_reloc);
+
+		if (key->type == STT_SECTION)
+			key = find_symbol_by_offset(key->sec, key_addend & ~3);
+
+		/* embedded keys not supported */
+		if (key) {
+			alt->key = key;
+			alt->key_addend = key_addend;
+		}
 	}
 
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (19 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry Valentin Schneider
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

CT_STATE_KERNEL being zero means it can be (and is) omitted in a handful of
places. A later patch will change CT_STATE_KERNEL into a non-zero value,
prepare that by using it where it should be:

o In the initial CT state
o At kernel entry / exit

No change in functionality intended.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/context_tracking.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 938c48952d265..a61498a8425e2 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 	.nesting = 1,
 	.nmi_nesting = CT_NESTING_IRQ_NONIDLE,
 #endif
-	.state = ATOMIC_INIT(CT_RCU_WATCHING),
+	.state = ATOMIC_INIT(CT_RCU_WATCHING | CT_STATE_KERNEL),
 };
 EXPORT_SYMBOL_GPL(context_tracking);
 
@@ -147,7 +147,7 @@ static void noinstr ct_kernel_exit(bool user, int offset)
 	instrumentation_end();
 	WRITE_ONCE(ct->nesting, 0); /* Avoid irq-access tearing. */
 	// RCU is watching here ...
-	ct_kernel_exit_state(offset);
+	ct_kernel_exit_state(offset - CT_STATE_KERNEL);
 	// ... but is no longer watching here.
 	rcu_task_exit();
 }
@@ -175,7 +175,7 @@ static void noinstr ct_kernel_enter(bool user, int offset)
 	}
 	rcu_task_enter();
 	// RCU is not watching here ...
-	ct_kernel_enter_state(offset);
+	ct_kernel_enter_state(offset + CT_STATE_KERNEL);
 	// ... but is watching here.
 	instrumentation_begin();
 
@@ -537,7 +537,7 @@ void noinstr __ct_user_enter(enum ctx_state state)
 				 * RCU only requires CT_RCU_WATCHING increments to be fully
 				 * ordered.
 				 */
-				raw_atomic_add(state, &ct->state);
+				raw_atomic_add(state - CT_STATE_KERNEL, &ct->state);
 			}
 		}
 	}
@@ -647,7 +647,7 @@ void noinstr __ct_user_exit(enum ctx_state state)
 				 * RCU only requires CT_RCU_WATCHING increments to be fully
 				 * ordered.
 				 */
-				raw_atomic_sub(state, &ct->state);
+				raw_atomic_sub(state - CT_STATE_KERNEL, &ct->state);
 			}
 		}
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (20 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-22  0:22   ` Frederic Weisbecker
  2025-01-14 17:51 ` [PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits Valentin Schneider
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't
modify the actual CT state part context_tracking.state. This means that
upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL
transition only happens in ct_idle_exit().

One can note that ct_nmi_enter() can only ever be entered with the CT state
as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the
CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().

Add/remove CT_STATE_IDLE from the context tracking state as needed in
ct_nmi_{enter, exit}().

Note that this leaves the following window where the CPU is executing code
in kernelspace, but the context tracking state is CT_STATE_IDLE:

  ~> IRQ
  ct_nmi_enter()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

  [...]

  ct_nmi_exit()
    state = state - CT_STATE_KERNEL + CT_STATE_IDLE

  [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\

  ct_cpuidle_exit()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/context_tracking.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a61498a8425e2..15f10ddec8cbe 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -236,7 +236,9 @@ void noinstr ct_nmi_exit(void)
 	instrumentation_end();
 
 	// RCU is watching here ...
-	ct_kernel_exit_state(CT_RCU_WATCHING);
+	ct_kernel_exit_state(CT_RCU_WATCHING -
+			     CT_STATE_KERNEL +
+			     CT_STATE_IDLE);
 	// ... but is no longer watching here.
 
 	if (!in_nmi())
@@ -259,6 +261,7 @@ void noinstr ct_nmi_enter(void)
 {
 	long incby = 2;
 	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	int curr_state;
 
 	/* Complain about underflow. */
 	WARN_ON_ONCE(ct_nmi_nesting() < 0);
@@ -271,13 +274,26 @@ void noinstr ct_nmi_enter(void)
 	 * to be in the outermost NMI handler that interrupted an RCU-idle
 	 * period (observation due to Andy Lutomirski).
 	 */
-	if (!rcu_is_watching_curr_cpu()) {
+	curr_state = raw_atomic_read(this_cpu_ptr(&context_tracking.state));
+	if (!(curr_state & CT_RCU_WATCHING)) {
 
 		if (!in_nmi())
 			rcu_task_enter();
 
+		/*
+		 * RCU isn't watching, so we're one of
+		 * CT_STATE_IDLE
+		 * CT_STATE_USER
+		 * CT_STATE_GUEST
+		 * guest/user entry is handled by ct_user_enter(), so this has
+		 * to be idle entry.
+		 */
+		WARN_ON_ONCE((curr_state & CT_STATE_MASK) != CT_STATE_IDLE);
+
 		// RCU is not watching here ...
-		ct_kernel_enter_state(CT_RCU_WATCHING);
+		ct_kernel_enter_state(CT_RCU_WATCHING +
+				      CT_STATE_KERNEL -
+				      CT_STATE_IDLE);
 		// ... but is watching here.
 
 		instrumentation_begin();
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (21 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure Valentin Schneider
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

A later patch will require to easily exclude CT_STATE_KERNEL from a genuine
a ct->state read CT_STATE_KERNEL, which requires that value being non-zero
and exclusive with the other CT_STATE_* values.

This increases the size of the CT_STATE region of the ct->state variable by
two bits.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/context_tracking_state.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 0b81248aa03e2..eb2149b20baef 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -11,11 +11,11 @@
 
 enum ctx_state {
 	CT_STATE_DISABLED	= -1,	/* returned by ct_state() if unknown */
-	CT_STATE_KERNEL		= 0,
-	CT_STATE_IDLE		= 1,
-	CT_STATE_USER		= 2,
-	CT_STATE_GUEST		= 3,
-	CT_STATE_MAX		= 4,
+	CT_STATE_KERNEL		= 1,
+	CT_STATE_IDLE		= 2,
+	CT_STATE_USER		= 4,
+	CT_STATE_GUEST		= 8,
+	CT_STATE_MAX		= 9,
 };
 
 struct context_tracking {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (22 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Nicolas Saenz Julienne, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.

Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work.

Later commits introduce the bit:callback mappings.

Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/Kconfig                                 |  9 +++
 arch/x86/Kconfig                             |  1 +
 arch/x86/include/asm/context_tracking_work.h | 16 +++++
 include/linux/context_tracking.h             | 21 +++++++
 include/linux/context_tracking_state.h       | 30 ++++++---
 include/linux/context_tracking_work.h        | 26 ++++++++
 kernel/context_tracking.c                    | 66 +++++++++++++++++++-
 kernel/time/Kconfig                          |  5 ++
 8 files changed, 162 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 6682b2a53e342..b637f20f0fc68 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -952,6 +952,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c42..ca5dd4cfc354b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -216,6 +216,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5f3b2d0977235
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(enum ct_work work)
+{
+	switch (work) {
+	case CT_WORK_n:
+		// Do work...
+		break;
+	case CT_WORK_MAX:
+		WARN_ON_ONCE(true);
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..0b0faa040e9b5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
@@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
 	return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long new, old, state;
+
+	state = arch_atomic_read(&ct->state);
+	do {
+		old = state;
+		new = old & ~CT_WORK_MASK;
+		new += incby;
+		state = arch_atomic_cmpxchg(&ct->state, old, new);
+	} while (old != state);
+
+	return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
 static __always_inline bool warn_rcu_enter(void)
 {
 	bool ret = false;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index eb2149b20baef..b6ddfccba9714 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/static_key.h>
 #include <linux/context_tracking_irq.h>
+#include <linux/context_tracking_work.h>
 
 /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
 #define CT_NESTING_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
@@ -39,16 +40,19 @@ struct context_tracking {
 };
 
 /*
- * We cram two different things within the same atomic variable:
+ * We cram up to three different things within the same atomic variable:
  *
- *                     CT_RCU_WATCHING_START  CT_STATE_START
- *                                |                |
- *                                v                v
- *     MSB [ RCU watching counter ][ context_state ] LSB
- *         ^                       ^
- *         |                       |
- * CT_RCU_WATCHING_END        CT_STATE_END
+ *                     CT_RCU_WATCHING_START                  CT_STATE_START
+ *                                |         CT_WORK_START          |
+ *                                |               |                |
+ *                                v               v                v
+ *     MSB [ RCU watching counter ][ context work ][ context_state ] LSB
+ *         ^                       ^               ^
+ *         |                       |               |
+ *         |                  CT_WORK_END          |
+ * CT_RCU_WATCHING_END                        CT_STATE_END
  *
+ * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n
  * Bits are used from the LSB upwards, so unused bits (if any) will always be in
  * upper bits of the variable.
  */
@@ -59,18 +63,24 @@ struct context_tracking {
 #define CT_STATE_START 0
 #define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
 
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0)
+#define	CT_WORK_START (CT_STATE_END + 1)
+#define CT_WORK_END   (CT_WORK_START + CT_WORK_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH)
 #define CT_RCU_WATCHING_WIDTH     (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
-#define CT_RCU_WATCHING_START     (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_START     (CT_WORK_END + 1)
 #define CT_RCU_WATCHING_END       (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
 #define CT_RCU_WATCHING           BIT(CT_RCU_WATCHING_START)
 
 #define CT_STATE_MASK        GENMASK(CT_STATE_END,        CT_STATE_START)
+#define CT_WORK_MASK         GENMASK(CT_WORK_END,         CT_WORK_START)
 #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
 
 #define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
 
 static_assert(CT_STATE_WIDTH        +
+	      CT_WORK_WIDTH         +
 	      CT_RCU_WATCHING_WIDTH +
 	      CT_UNUSED_WIDTH       ==
 	      CT_SIZE);
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..c68245f8d77c5
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CT_WORK_n_OFFSET,
+	CT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 15f10ddec8cbe..f7019c12269f9 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,67 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+	int bit;
+
+	seq = (seq & CT_WORK_MASK) >> CT_WORK_START;
+
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &seq, CT_WORK_MAX)
+		arch_context_tracking_work(BIT(bit));
+}
+
+/**
+ * ct_set_cpu_work - set work to be run at next kernel context entry
+ *
+ * If @cpu is not currently executing in kernelspace, it will execute the
+ * callback mapped to @work (see arch_context_tracking_work()) at its next
+ * transition to CT_KERNEL_STATE.
+ *
+ * If it is already in CT_KERNEL_STATE, this will be a no-op.
+ */
+bool ct_set_cpu_work(unsigned int cpu, enum ct_work work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old;
+	bool ret = false;
+
+	if (!ct->active)
+		return false;
+
+	preempt_disable();
+
+	old = atomic_read(&ct->state);
+
+	/*
+	 * We only want to set the work bit if the target CPU is not in
+	 * kernelspace, so we clear the KERNEL bit here and let the cmpxchg do
+	 * the check for us - the state could change between the atomic_read() and
+	 * the cmpxchg().
+	 */
+	old &= ~CT_STATE_KERNEL;
+	/*
+	 * Try setting the work until either
+	 * - the target CPU has entered kernelspace
+	 * - the work has been set
+	 */
+	do {
+		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
+	} while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL));
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -88,7 +149,7 @@ static noinstr void ct_kernel_exit_state(int offset)
 	 * next idle sojourn.
 	 */
 	rcu_task_trace_heavyweight_enter();  // Before CT state update!
-	seq = ct_state_inc(offset);
+	seq = ct_state_inc_clear_work(offset);
 	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & CT_RCU_WATCHING));
 }
@@ -100,7 +161,7 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
-	int seq;
+	unsigned long seq;
 
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
@@ -108,6 +169,7 @@ static noinstr void ct_kernel_enter_state(int offset)
 	 * critical section.
 	 */
 	seq = ct_state_inc(offset);
+	ct_work_flush(seq);
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_task_trace_heavyweight_exit();  // After CT state update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6f..7e8106a0d981f 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (23 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 21:13   ` Sean Christopherson
                     ` (2 more replies)
  2025-01-14 17:51 ` [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant Valentin Schneider
                   ` (4 subsequent siblings)
  29 siblings, 3 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Peter Zijlstra, Nicolas Saenz Julienne, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- static calls
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys and static calls.

Any static key used in early entry code should be only forever-enabled at
boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions
are explicitly marked as allowed in .noinstr and will always generate an
IPI when flipped.

The same applies to static calls - they should be only updated at boot
time, or manually marked as an exception.

Objtool is now able to point at static keys/calls that don't respect this,
and all static keys/calls used in early entry code have now been verified
as behaving appropriately.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  6 ++--
 arch/x86/include/asm/text-patching.h         |  1 +
 arch/x86/kernel/alternative.c                | 38 ++++++++++++++++----
 arch/x86/kernel/kprobes/core.c               |  4 +--
 arch/x86/kernel/kprobes/opt.c                |  4 +--
 arch/x86/kernel/module.c                     |  2 +-
 include/asm-generic/sections.h               | 15 ++++++++
 include/linux/context_tracking_work.h        |  4 +--
 8 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5f3b2d0977235..485b32881fde5 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include <asm/sync_core.h>
+
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
 	switch (work) {
-	case CT_WORK_n:
-		// Do work...
+	case CT_WORK_SYNC:
+		sync_core();
 		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index ab9e143ec9fea..9dfa46f721c1d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 #define text_poke_copy text_poke_copy
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 243843e44e89d..633deea8a89cb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
 #include <linux/sync_core.h>
+#include <linux/context_tracking.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -2109,9 +2110,24 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+	on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void text_poke_sync(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	__text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+	__text_poke_sync(do_sync_core_defer_cond);
 }
 
 /*
@@ -2282,6 +2298,7 @@ static int tp_vec_nr;
  */
 static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+	smp_cond_func_t cond = do_sync_core_defer_cond;
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
 	int do_sync;
@@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * First step: add a int3 trap to the address that will be patched.
 	 */
 	for (i = 0; i < nr_entries; i++) {
-		tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
-		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
+		void *addr = text_poke_addr(&tp[i]);
+
+		/*
+		 * There's no safe way to defer IPIs for patching text in
+		 * .noinstr, record whether there is at least one such poke.
+		 */
+		if (is_kernel_noinstr_text((unsigned long)addr))
+			cond = NULL;
+
+		tp[i].old = *((u8 *)addr);
+		text_poke(addr, &int3, INT3_INSN_SIZE);
 	}
 
-	text_poke_sync();
+	__text_poke_sync(cond);
 
 	/*
 	 * Second step: update all but the first byte of the patched range.
@@ -2383,7 +2409,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		 * not necessary and we'd be safe even without it. But
 		 * better safe than sorry (plus there's not only Intel).
 		 */
-		text_poke_sync();
+		__text_poke_sync(cond);
 	}
 
 	/*
@@ -2404,7 +2430,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	}
 
 	if (do_sync)
-		text_poke_sync();
+		__text_poke_sync(cond);
 
 	/*
 	 * Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 72e6a45e7ec24..c2fd2578fd5fc 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -817,7 +817,7 @@ void arch_arm_kprobe(struct kprobe *p)
 	u8 int3 = INT3_INSN_OPCODE;
 
 	text_poke(p->addr, &int3, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
 }
 
@@ -827,7 +827,7 @@ void arch_disarm_kprobe(struct kprobe *p)
 
 	perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
 	text_poke(p->addr, &p->opcode, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 36d6809c6c9e1..b2ce4d9c3ba56 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	       JMP32_INSN_SIZE - INT3_INSN_SIZE);
 
 	text_poke(addr, new, INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	text_poke(addr + INT3_INSN_SIZE,
 		  new + INT3_INSN_SIZE,
 		  JMP32_INSN_SIZE - INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 
 	perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
 }
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 8984abd91c001..acc810150b76c 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -194,7 +194,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 				   write, apply);
 
 	if (!early) {
-		text_poke_sync();
+		text_poke_sync_deferrable();
 		mutex_unlock(&text_mutex);
 	}
 
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index c768de6f19a9a..1b3ba3084fd2f 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr)
 	       addr < (unsigned long)_einittext;
 }
 
+
+/**
+ * is_kernel_noinstr_text - checks if the pointer address is located in the
+ *                    .noinstr section
+ *
+ * @addr: address to check
+ *
+ * Returns: true if the address is located in .noinstr, false otherwise.
+ */
+static inline bool is_kernel_noinstr_text(unsigned long addr)
+{
+	return addr >= (unsigned long)__noinstr_text_start &&
+	       addr < (unsigned long)__noinstr_text_end;
+}
+
 /**
  * __is_kernel_text - checks if the pointer address is located in the
  *                    .text section
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index c68245f8d77c5..931ade1dcbcc2 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,12 @@
 #include <linux/bitops.h>
 
 enum {
-	CT_WORK_n_OFFSET,
+	CT_WORK_SYNC,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
-	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
+	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
 	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (24 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 21:45   ` Dave Hansen
  2025-01-14 17:51 ` [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant Valentin Schneider
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Peter Zijlstra, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

From: Peter Zijlstra <peterz@infradead.org>

Later patches will require issuing a __flush_tlb_all() from noinstr code.
This requires making both __flush_tlb_local() and __flush_tlb_global()
noinstr-compliant.

For __flush_tlb_global(), both native_flush_tlb_global() and xen_flush_tlb()
need to be made noinstr.

Forgo using __native_flush_tlb_global() / native_write_cr4() and have the
ASM directly inlined in the native function. For the Xen stuff,
__always_inline a handful of helpers.

Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
[Changelog faff]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/invpcid.h       | 13 ++++++-------
 arch/x86/include/asm/paravirt.h      |  2 +-
 arch/x86/include/asm/xen/hypercall.h | 11 +++++++++--
 arch/x86/mm/tlb.c                    | 15 +++++++++++----
 arch/x86/xen/mmu_pv.c                | 10 +++++-----
 arch/x86/xen/xen-ops.h               | 12 ++++++++----
 6 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h
index 734482afbf81d..ff26136fcd9c6 100644
--- a/arch/x86/include/asm/invpcid.h
+++ b/arch/x86/include/asm/invpcid.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_X86_INVPCID
 #define _ASM_X86_INVPCID
 
-static inline void __invpcid(unsigned long pcid, unsigned long addr,
+static __always_inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
 {
 	struct { u64 d[2]; } desc = { { pcid, addr } };
@@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 	 * mappings, we don't want the compiler to reorder any subsequent
 	 * memory accesses before the TLB flush.
 	 */
-	asm volatile("invpcid %[desc], %[type]"
+	asm_inline volatile("invpcid %[desc], %[type]"
 		     :: [desc] "m" (desc), [type] "r" (type) : "memory");
 }
 
@@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 #define INVPCID_TYPE_ALL_NON_GLOBAL	3
 
 /* Flush all mappings for a given pcid and addr, not including globals. */
-static inline void invpcid_flush_one(unsigned long pcid,
-				     unsigned long addr)
+static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr)
 {
 	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invpcid_flush_single_context(unsigned long pcid)
+static __always_inline void invpcid_flush_single_context(unsigned long pcid)
 {
 	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
-static inline void invpcid_flush_all(void)
+static __always_inline void invpcid_flush_all(void)
 {
 	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
 }
 
 /* Flush all mappings for all PCIDs except globals. */
-static inline void invpcid_flush_all_nonglobals(void)
+static __always_inline void invpcid_flush_all_nonglobals(void)
 {
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8e..b3daee3d46677 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void)
 	PVOP_VCALL0(mmu.flush_tlb_user);
 }
 
-static inline void __flush_tlb_global(void)
+static __always_inline void __flush_tlb_global(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_kernel);
 }
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index 97771b9d33af3..291e9f8006f62 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -365,8 +365,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req,
 	trace_xen_mc_entry(mcl, 4);
 }
 
-static inline void
-MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+static __always_inline void
+__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
 		int *success_count, domid_t domid)
 {
 	mcl->op = __HYPERVISOR_mmuext_op;
@@ -374,6 +374,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
 	mcl->args[1] = count;
 	mcl->args[2] = (unsigned long)success_count;
 	mcl->args[3] = domid;
+}
+
+static inline void
+MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+		int *success_count, domid_t domid)
+{
+	__MULTI_mmuext_op(mcl, op, count, success_count, domid);
 
 	trace_xen_mc_entry(mcl, 4);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a2becb85bea79..2d2ab3e221f0c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1169,9 +1169,10 @@ void flush_tlb_one_user(unsigned long addr)
 /*
  * Flush everything
  */
-STATIC_NOPV void native_flush_tlb_global(void)
+STATIC_NOPV noinstr void native_flush_tlb_global(void)
 {
 	unsigned long flags;
+	unsigned long cr4;
 
 	if (static_cpu_has(X86_FEATURE_INVPCID)) {
 		/*
@@ -1190,9 +1191,15 @@ STATIC_NOPV void native_flush_tlb_global(void)
 	 * be called from deep inside debugging code.)
 	 */
 	raw_local_irq_save(flags);
-
-	__native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
-
+	cr4 = this_cpu_read(cpu_tlbstate.cr4);
+	asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
+	asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
+	/*
+	 * In lieu of not having the pinning crap, hard fail if CR4 doesn't
+	 * match the expected value. This ensures that anybody doing dodgy gets
+	 * the fallthrough check.
+	 */
+	BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));
 	raw_local_irq_restore(flags);
 }
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04f..4eb265eb867af 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1231,22 +1231,22 @@ static noinstr void xen_write_cr2(unsigned long cr2)
 	this_cpu_read(xen_vcpu)->arch.cr2 = cr2;
 }
 
-static noinline void xen_flush_tlb(void)
+static noinline noinstr void xen_flush_tlb(void)
 {
 	struct mmuext_op *op;
 	struct multicall_space mcs;
 
-	preempt_disable();
+	preempt_disable_notrace();
 
 	mcs = xen_mc_entry(sizeof(*op));
 
 	op = mcs.args;
 	op->cmd = MMUEXT_TLB_FLUSH_LOCAL;
-	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+	__MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-	xen_mc_issue(XEN_LAZY_MMU);
+	__xen_mc_issue(XEN_LAZY_MMU);
 
-	preempt_enable();
+	preempt_enable_notrace();
 }
 
 static void xen_flush_tlb_one_user(unsigned long addr)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 63c13a2ccf556..effb1a54afbd1 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -235,15 +235,19 @@ static inline struct multicall_space xen_mc_entry(size_t args)
 void xen_mc_flush(void);
 
 /* Issue a multicall if we're not in a lazy mode */
-static inline void xen_mc_issue(unsigned mode)
+static __always_inline void __xen_mc_issue(unsigned mode)
 {
-	trace_xen_mc_issue(mode);
-
 	if ((xen_get_lazy_mode() & mode) == 0)
 		xen_mc_flush();
 
 	/* restore flags saved in xen_mc_batch */
-	local_irq_restore(this_cpu_read(xen_mc_irq_flags));
+	raw_local_irq_restore(this_cpu_read(xen_mc_irq_flags));
+}
+
+static inline void xen_mc_issue(unsigned mode)
+{
+	trace_xen_mc_issue(mode);
+	__xen_mc_issue(mode);
 }
 
 /* Set up a callback to be called when the current batch is flushed */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (25 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 21:24   ` Sean Christopherson
  2025-01-14 17:51 ` [PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr Valentin Schneider
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later patches will require issuing a __flush_tlb_all() from noinstr code.
This requires making both __flush_tlb_local() and __flush_tlb_global()
noinstr-compliant.

For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so
it's just native_flush_tlb_global(), and simply __always_inline'ing
invalidate_user_asid() gets us there

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/paravirt.h | 2 +-
 arch/x86/mm/tlb.c               | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b3daee3d46677..0c0dd186c03e6 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -70,7 +70,7 @@ void native_flush_tlb_one_user(unsigned long addr);
 void native_flush_tlb_multi(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
-static inline void __flush_tlb_local(void)
+static __always_inline void __flush_tlb_local(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_user);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2d2ab3e221f0c..18b40bbc2fa15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -257,7 +257,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
  *
  * See SWITCH_TO_USER_CR3.
  */
-static inline void invalidate_user_asid(u16 asid)
+static __always_inline void invalidate_user_asid(u16 asid)
 {
 	/* There is no user ASID if address space separation is off */
 	if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION))
@@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void)
 /*
  * Flush the entire current user mapping
  */
-STATIC_NOPV void native_flush_tlb_local(void)
+STATIC_NOPV noinstr void native_flush_tlb_local(void)
 {
 	/*
 	 * Preemption or interrupts must be disabled to protect the access
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (26 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
  2025-01-14 17:51 ` [PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Later patches will require issuing a __flush_tlb_all() from noinstr code.
Both __flush_tlb_local() and __flush_tlb_global() are now
noinstr-compliant, so __flush_tlb_all() can be made noinstr itself.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h | 2 +-
 arch/x86/mm/tlb.c               | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b80..4d11396250999 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,7 +17,7 @@
 
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
-void __flush_tlb_all(void);
+noinstr void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL	-1UL
 #define TLB_GENERATION_INVALID	0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 18b40bbc2fa15..119765772ab11 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1229,7 +1229,7 @@ void flush_tlb_local(void)
 /*
  * Flush everything
  */
-void __flush_tlb_all(void)
+noinstr void __flush_tlb_all(void)
 {
 	/*
 	 * This is to catch users with enabled preemption and the PGE feature
@@ -1243,7 +1243,7 @@ void __flush_tlb_all(void)
 		/*
 		 * !PGE -> !PCID (setup_pcid()), thus every flush is total.
 		 */
-		flush_tlb_local();
+		__flush_tlb_local();
 	}
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (27 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  2025-01-14 18:16   ` Jann Horn
  2025-01-14 17:51 ` [PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider
  29 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

Deferral vs early entry danger zone
===================================

This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.

Vmalloc uses are, as reported by vmallocinfo:

  $ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq
  __pci_enable_msix_range+0x32b/0x560
  acpi_os_map_iomem+0x22d/0x250
  bpf_prog_alloc_no_stats+0x34/0x140
  fork_idle+0x79/0x120
  gen_pool_add_owner+0x3e/0xb0          ?
  hpet_enable+0xbf/0x470
  irq_init_percpu_irqstack+0x129/0x160
  kernel_clone+0xab/0x3b0
  memremap+0x164/0x330
  n_tty_open+0x18/0xa0
  pcpu_create_chunk+0x4e/0x1b0
  pcpu_create_chunk+0x75/0x1b0
  pcpu_get_vm_areas+0x0/0x1100
  unpurged
  vp_modern_map_capability+0x140/0x270
  zisofs_init+0x16/0x30

I've categorized these as:

a) Device or percpu mappings

   For these to be unmapped, the device (or CPU) has to be removed and an
   eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry
   happens before handling the IRQ itself, per irqentry_enter() usage.

   __pci_enable_msix_range()
   acpi_os_map_iomem()
   irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!)
   memremap()
   n_tty_open()
   pcpu_create_chunk()
   pcpu_get_vm_areas()
   vp_modern_map_capability()

b) CONFIG_VMAP_STACK

  fork_idle() & kernel_clone()

  vmalloc'd kernel stacks are AFAICT a safe example, as a task running in
  userspace needs to enter kernelspace to execute do_exit() before its
  stack can be vfree'd.

c) Non-issues

  bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF!
  hpet_enable() - hpet_clear_mapping() is only called if __init function
		  fails, no runtime worries
  zisofs_init () - used for zisofs block decompression, that's way past
		   context tracking entry

d) I'm not sure, have a look?

  gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which
			 again I wouldn't expect to be accessed before
			 context tracking entry.

Changes
======

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  4 +++
 arch/x86/include/asm/tlbflush.h              |  1 +
 arch/x86/mm/tlb.c                            | 23 +++++++++++--
 include/linux/context_tracking_work.h        |  4 ++-
 mm/vmalloc.c                                 | 35 +++++++++++++++++---
 5 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 485b32881fde5..da3d270289836 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <asm/tlbflush.h>
 
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work)
 	case CT_WORK_SYNC:
 		sync_core();
 		break;
+	case CT_WORK_TLBI:
+		__flush_tlb_all();
+		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
 	}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4d11396250999..6e690acc35e53 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 119765772ab11..47fb437acf52a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/context_tracking.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1042,6 +1043,11 @@ static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CT_WORK_TLBI);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1058,12 +1064,13 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
 {
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info *info;
 
@@ -1071,13 +1078,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		info = get_flush_tlb_info(NULL, start, end, 0, false,
 					  TLB_GENERATION_INVALID);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
 		preempt_enable();
 	}
 }
 
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 931ade1dcbcc2..1be60c64cdeca 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,14 @@
 #include <linux/bitops.h>
 
 enum {
-	CT_WORK_SYNC,
+	CT_WORK_SYNC_OFFSET,
+	CT_WORK_TLBI_OFFSET,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
+	CT_WORK_TLBI     = BIT(CT_WORK_TLBI_OFFSET),
 	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
 };
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5c88d0e90c209..e8aad4d55e955 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -467,6 +467,31 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
 	__vunmap_range_noflush(start, end);
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a
+ * minefield.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -480,7 +505,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
 {
 	flush_cache_vunmap(addr, end);
 	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2281,7 +2306,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 
 	nr_purge_nodes = cpumask_weight(&purge_nodes);
 	if (nr_purge_nodes > 0) {
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 
 		/* One extra worker is per a lazy_max_pages() full set minus one. */
 		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2384,7 +2409,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 	flush_cache_vunmap(va->va_start, va->va_end);
 	vunmap_range_noflush(va->va_start, va->va_end);
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
 
 	free_vmap_area_noflush(va);
 }
@@ -2832,7 +2857,7 @@ static void vb_free(unsigned long addr, unsigned long size)
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);
 
 	spin_lock(&vb->lock);
 
@@ -2897,7 +2922,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	free_purged_blocks(&purge_list);
 
 	if (!__purge_vmap_area_lazy(start, end, false) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
  2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (28 preceding siblings ...)
  2025-01-14 17:51 ` [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
@ 2025-01-14 17:51 ` Valentin Schneider
  29 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-14 17:51 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
transitions:

  ct_idle_enter()
    ct_kernel_exit()
      ct_state_inc_clear_work()

  ct_idle_exit()
    ct_kernel_enter()
      ct_work_flush()

With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just
ct_state_inc() and ct_work_flush() is a no-op. However, making them be
functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to
leverage IPI deferral to keep idle CPUs idle longer.

Having this enabled for NO_HZ_IDLE is a different argument than for having
it for NO_HZ_FULL (power savings vs latency/performance), but the backing
mechanism is identical.

Add a default-no option to enable IPI deferral with NO_HZ_IDLE.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/time/Kconfig | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7e8106a0d981f..c7398fe5382a0 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -183,9 +183,23 @@ config CONTEXT_TRACKING_USER_FORCE
 
 config CONTEXT_TRACKING_WORK
 	bool
-	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || CONTEXT_TRACKING_WORK_IDLE)
 	default y
 
+config CONTEXT_TRACKING_WORK_IDLE
+       bool
+       depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && !CONTEXT_TRACKING_USER
+       default n
+       help
+	 This option enables deferral of some IPIs when they are targeted at CPUs
+	 that are idle. This can help keep CPUs idle longer, but induces some
+	 extra overhead to idle <-> kernel transitions and to IPI sending.
+
+	 Say Y if the power improvements are worth more to you than the added
+	 overheads.
+
+	 Say N otherwise.
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-14 17:51 ` [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
@ 2025-01-14 18:16   ` Jann Horn
  2025-01-17 15:25     ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Jann Horn @ 2025-01-14 18:16 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
>
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.
>
> Deferral vs early entry danger zone
> ===================================
>
> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> and then accessed in early entry code.

In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  2025-01-14 17:51 ` [PATCH v4 10/30] riscv/paravirt: " Valentin Schneider
@ 2025-01-14 18:29   ` Andrew Jones
  0 siblings, 0 replies; 86+ messages in thread
From: Andrew Jones @ 2025-01-14 18:29 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025 at 06:51:23PM +0100, Valentin Schneider wrote:
> The static call is only ever updated in:
> 
>   __init pv_time_init()
>   __init xen_time_setup_guest()
> 
> so mark it appropriately as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/riscv/kernel/paravirt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
> index fa6b0339a65de..dfe8808016fd8 100644
> --- a/arch/riscv/kernel/paravirt.c
> +++ b/arch/riscv/kernel/paravirt.c
> @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
>  	return 0;
>  }
>  
> -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
> +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
>  
>  static bool steal_acc = true;
>  static int __init parse_no_stealacc(char *arg)
> -- 
> 2.43.0
>

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2025-01-14 21:13   ` Sean Christopherson
  2025-01-14 21:48     ` Sean Christopherson
  2025-01-17  9:47     ` Valentin Schneider
  2025-01-14 21:26   ` Sean Christopherson
  2025-01-24 10:48   ` K Prateek Nayak
  2 siblings, 2 replies; 86+ messages in thread
From: Sean Christopherson @ 2025-01-14 21:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025, Valentin Schneider wrote:
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.

...

> This leaves us with static keys and static calls.

...

> @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	 * First step: add a int3 trap to the address that will be patched.
>  	 */
>  	for (i = 0; i < nr_entries; i++) {
> -		tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
> -		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
> +		void *addr = text_poke_addr(&tp[i]);
> +
> +		/*
> +		 * There's no safe way to defer IPIs for patching text in
> +		 * .noinstr, record whether there is at least one such poke.
> +		 */
> +		if (is_kernel_noinstr_text((unsigned long)addr))
> +			cond = NULL;

Maybe pre-check "cond", especially if multiple ranges need to be checked?  I.e.

		if (cond && is_kernel_noinstr_text(...))
> +
> +		tp[i].old = *((u8 *)addr);
> +		text_poke(addr, &int3, INT3_INSN_SIZE);
>  	}
>  
> -	text_poke_sync();
> +	__text_poke_sync(cond);
>  
>  	/*
>  	 * Second step: update all but the first byte of the patched range.

...

> +/**
> + * is_kernel_noinstr_text - checks if the pointer address is located in the
> + *                    .noinstr section
> + *
> + * @addr: address to check
> + *
> + * Returns: true if the address is located in .noinstr, false otherwise.
> + */
> +static inline bool is_kernel_noinstr_text(unsigned long addr)
> +{
> +	return addr >= (unsigned long)__noinstr_text_start &&
> +	       addr < (unsigned long)__noinstr_text_end;
> +}

This doesn't do the right thing for modules, which matters because KVM can be
built as a module on x86, and because context tracking understands transitions
to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
use the wrong static path.

I don't expect this to ever cause problems in practice, because patching code in
KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
actively running guest code, would be all kinds of crazy.  But I do think we
should plug the hole.

If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
assume module code generally isn't allowed in the entry path, even via NMI?), one
idea would be to let KVM register its noinstr section for text poking.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-01-14 17:51 ` [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
@ 2025-01-14 21:19   ` Sean Christopherson
  2025-01-17  9:50     ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Sean Christopherson @ 2025-01-14 21:19 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Josh Poimboeuf, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Pawan Gupta, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Please use "KVM: VMX:" for the scope.

On Tue, Jan 14, 2025, Valentin Schneider wrote:
> Later commits will cause objtool to warn about static keys being used in
> .noinstr sections in order to safely defer instruction patching IPIs
> targeted at NOHZ_FULL CPUs.
> 
> These keys are used in .noinstr code, and can be modified at runtime
> (/proc/kernel/vmx* write). However it is not expected that they will be
> flipped during latency-sensitive operations, and thus shouldn't be a source
> of interference wrt the text patching IPI.

This misses KVM's static key that's buried behind CONFIG_HYPERV=m|y.

vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x241: __kvm_is_using_evmcs: non-RO static key usage in noinstr
vmlinux.o: warning: objtool: vmx_update_host_rsp+0x13: __kvm_is_using_evmcs: non-RO static key usage in noinstr

Side topic, it's super annoying that "objtool --noinstr" only runs on vmlinux.o.
I realize objtool doesn't have the visilibity to validate cross-object calls,
but couldn't objtool validates calls and static key/branch usage so long as the
target or key/branch is defined in the same object?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant
  2025-01-14 17:51 ` [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant Valentin Schneider
@ 2025-01-14 21:24   ` Sean Christopherson
  0 siblings, 0 replies; 86+ messages in thread
From: Sean Christopherson @ 2025-01-14 21:24 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025, Valentin Schneider wrote:
> Later patches will require issuing a __flush_tlb_all() from noinstr code.
> This requires making both __flush_tlb_local() and __flush_tlb_global()
> noinstr-compliant.
> 
> For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so
> it's just native_flush_tlb_global(), and simply __always_inline'ing
> invalidate_user_asid() gets us there
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---

...

> @@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void)
>  /*
>   * Flush the entire current user mapping
>   */
> -STATIC_NOPV void native_flush_tlb_local(void)
> +STATIC_NOPV noinstr void native_flush_tlb_local(void)

native_write_cr3() and __native_read_cr3() need to be __always_inline.

vmlinux.o: warning: objtool: native_flush_tlb_local+0x8: call to __native_read_cr3() leaves .noinstr.text section

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
  2025-01-14 21:13   ` Sean Christopherson
@ 2025-01-14 21:26   ` Sean Christopherson
  2025-01-24 10:48   ` K Prateek Nayak
  2 siblings, 0 replies; 86+ messages in thread
From: Sean Christopherson @ 2025-01-14 21:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025, Valentin Schneider wrote:
>  static __always_inline void arch_context_tracking_work(enum ct_work work)
>  {
>  	switch (work) {
> -	case CT_WORK_n:
> -		// Do work...
> +	case CT_WORK_SYNC:
> +		sync_core();

Not your bug, but serialize() needs to be __always_inline.  Not sure what exactly
caused it to be uninlined, but this is the obvious new usage.

vmlinux.o: warning: objtool: __static_call_update_early+0x4e: call to serialize() leaves .noinstr.text section
vmlinux.o: warning: objtool: ct_work_flush+0x69: call to serialize() leaves .noinstr.text section

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant
  2025-01-14 17:51 ` [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant Valentin Schneider
@ 2025-01-14 21:45   ` Dave Hansen
  2025-01-17 13:44     ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-01-14 21:45 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list
  Cc: Peter Zijlstra, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

[-- Attachment #1: Type: text/plain, Size: 1145 bytes --]

On 1/14/25 09:51, Valentin Schneider wrote:
> +	cr4 = this_cpu_read(cpu_tlbstate.cr4);
> +	asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
> +	asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
> +	/*
> +	 * In lieu of not having the pinning crap, hard fail if CR4 doesn't
> +	 * match the expected value. This ensures that anybody doing dodgy gets
> +	 * the fallthrough check.
> +	 */
> +	BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));

Let's say someone managed to write to cpu_tlbstate.cr4 where they
cleared one of the pinned bits.

Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly
and also reset the cleared bits.

After this patch, the first native_flush_tlb_global() can clear pinned
bits, at least until native_write_cr4() gets called the next time. That
seems like it'll undermine CR4 pinning at least somewhat.

What keeps native_write_cr4() from being noinstr-compliant now? Is it
just the WARN_ONCE()?

If so, I'd kinda rather have a native_write_cr4_nowarn() that's
noinstr-compliant but retains all the other CR4 pinning behavior. Would
something like the attached patch be _worse_?

[-- Attachment #2: cr4.patch --]
[-- Type: text/x-patch, Size: 1484 bytes --]

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e9037690814..2044d516f06f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -423,24 +423,40 @@ void native_write_cr0(unsigned long val)
 }
 EXPORT_SYMBOL(native_write_cr0);
 
-void __no_profile native_write_cr4(unsigned long val)
+void __no_profile __native_write_cr4(unsigned long val, unsigned long *bits_changed)
 {
-	unsigned long bits_changed = 0;
-
 set_register:
 	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");
 
 	if (static_branch_likely(&cr_pinning)) {
 		if (unlikely((val & cr4_pinned_mask) != cr4_pinned_bits)) {
-			bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
+			*bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
 			val = (val & ~cr4_pinned_mask) | cr4_pinned_bits;
 			goto set_register;
 		}
-		/* Warn after we've corrected the changed bits. */
-		WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
-			  bits_changed);
 	}
 }
+
+void __no_profile native_write_cr4(unsigned long val)
+{
+	unsigned long bits_changed = 0;
+
+	__native_write_cr4(val, &bits_changed);
+
+	if (!bits_changed)
+		return
+
+	WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
+		  bits_changed);
+}
+
+void __no_profile native_write_cr4_nowarn(unsigned long val)
+{
+	unsigned long bits_changed = 0;
+
+	__native_write_cr4(val, &bits_changed);
+}
+
 #if IS_MODULE(CONFIG_LKDTM)
 EXPORT_SYMBOL_GPL(native_write_cr4);
 #endif

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 21:13   ` Sean Christopherson
@ 2025-01-14 21:48     ` Sean Christopherson
  2025-01-17  9:54       ` Valentin Schneider
  2025-01-17  9:47     ` Valentin Schneider
  1 sibling, 1 reply; 86+ messages in thread
From: Sean Christopherson @ 2025-01-14 21:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On Tue, Jan 14, 2025, Sean Christopherson wrote:
> On Tue, Jan 14, 2025, Valentin Schneider wrote:
> > +/**
> > + * is_kernel_noinstr_text - checks if the pointer address is located in the
> > + *                    .noinstr section
> > + *
> > + * @addr: address to check
> > + *
> > + * Returns: true if the address is located in .noinstr, false otherwise.
> > + */
> > +static inline bool is_kernel_noinstr_text(unsigned long addr)
> > +{
> > +	return addr >= (unsigned long)__noinstr_text_start &&
> > +	       addr < (unsigned long)__noinstr_text_end;
> > +}
> 
> This doesn't do the right thing for modules, which matters because KVM can be
> built as a module on x86, and because context tracking understands transitions
> to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
> being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
> or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
> patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
> use the wrong static path.
> 
> I don't expect this to ever cause problems in practice, because patching code in
> KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
> actively running guest code, would be all kinds of crazy.  But I do think we
> should plug the hole.
> 
> If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
> assume module code generally isn't allowed in the entry path, even via NMI?), one
> idea would be to let KVM register its noinstr section for text poking.

Another idea would be to track which keys/branches are tagged noinstr, i.e. generate
the information at compile time instead of doing lookups at runtime.  The biggest
downside I can think of is that it would require plumbing in the information to
text_poke_bp_batch().

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 21:13   ` Sean Christopherson
  2025-01-14 21:48     ` Sean Christopherson
@ 2025-01-17  9:47     ` Valentin Schneider
  2025-01-17 17:15       ` Sean Christopherson
  1 sibling, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17  9:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On 14/01/25 13:13, Sean Christopherson wrote:
> On Tue, Jan 14, 2025, Valentin Schneider wrote:
>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>> them vs the newly patched instruction. CPUs that are executing in userspace
>> do not need this synchronization to happen immediately, and this is
>> actually harmful interference for NOHZ_FULL CPUs.
>
> ...
>
>> This leaves us with static keys and static calls.
>
> ...
>
>> @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>>       * First step: add a int3 trap to the address that will be patched.
>>       */
>>      for (i = 0; i < nr_entries; i++) {
>> -		tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
>> -		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
>> +		void *addr = text_poke_addr(&tp[i]);
>> +
>> +		/*
>> +		 * There's no safe way to defer IPIs for patching text in
>> +		 * .noinstr, record whether there is at least one such poke.
>> +		 */
>> +		if (is_kernel_noinstr_text((unsigned long)addr))
>> +			cond = NULL;
>
> Maybe pre-check "cond", especially if multiple ranges need to be checked?  I.e.
>
>               if (cond && is_kernel_noinstr_text(...))
>> +
>> +		tp[i].old = *((u8 *)addr);
>> +		text_poke(addr, &int3, INT3_INSN_SIZE);
>>      }
>>
>> -	text_poke_sync();
>> +	__text_poke_sync(cond);
>>
>>      /*
>>       * Second step: update all but the first byte of the patched range.
>
> ...
>
>> +/**
>> + * is_kernel_noinstr_text - checks if the pointer address is located in the
>> + *                    .noinstr section
>> + *
>> + * @addr: address to check
>> + *
>> + * Returns: true if the address is located in .noinstr, false otherwise.
>> + */
>> +static inline bool is_kernel_noinstr_text(unsigned long addr)
>> +{
>> +	return addr >= (unsigned long)__noinstr_text_start &&
>> +	       addr < (unsigned long)__noinstr_text_end;
>> +}
>
> This doesn't do the right thing for modules, which matters because KVM can be
> built as a module on x86, and because context tracking understands transitions
> to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
> being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
> or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
> patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
> use the wrong static path.
>

AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions
and thus such a static key usage should at the very least be caught and
warned about by objtool - when this isn't built as a module.

I never really thought about noinstr sections for modules; I can get
objtool to warn about a non-noinstr allowed key being used in
e.g. vmx_vcpu_enter_exit() just by feeding it the vmx.o:

arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit.isra.0+0x0: dummykey: non-RO static key usage in noinstr

...but that requires removing a lot of code first because objtool stops
earlier in its noinstr checks as it hits functions it doesn't have full
information on, e.g.

arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit+0x21c: call to __ct_user_enter() leaves .noinstr.text section

__ct_user_enter() *is* noinstr, but you don't get that from just the header prototype.

> I don't expect this to ever cause problems in practice, because patching code in
> KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
> actively running guest code, would be all kinds of crazy.  But I do think we
> should plug the hole.
>
> If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
> assume module code generally isn't allowed in the entry path, even via NMI?), one
> idea would be to let KVM register its noinstr section for text poking.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
  2025-01-14 21:19   ` Sean Christopherson
@ 2025-01-17  9:50     ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17  9:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Josh Poimboeuf, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Pawan Gupta, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 14/01/25 13:19, Sean Christopherson wrote:
> Please use "KVM: VMX:" for the scope.
>
> On Tue, Jan 14, 2025, Valentin Schneider wrote:
>> Later commits will cause objtool to warn about static keys being used in
>> .noinstr sections in order to safely defer instruction patching IPIs
>> targeted at NOHZ_FULL CPUs.
>>
>> These keys are used in .noinstr code, and can be modified at runtime
>> (/proc/kernel/vmx* write). However it is not expected that they will be
>> flipped during latency-sensitive operations, and thus shouldn't be a source
>> of interference wrt the text patching IPI.
>
> This misses KVM's static key that's buried behind CONFIG_HYPERV=m|y.
>
> vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x241: __kvm_is_using_evmcs: non-RO static key usage in noinstr
> vmlinux.o: warning: objtool: vmx_update_host_rsp+0x13: __kvm_is_using_evmcs: non-RO static key usage in noinstr
>

Thanks, I'll add these to v5.

> Side topic, it's super annoying that "objtool --noinstr" only runs on vmlinux.o.
> I realize objtool doesn't have the visilibity to validate cross-object calls,
> but couldn't objtool validates calls and static key/branch usage so long as the
> target or key/branch is defined in the same object?

Per my testing you can manually run it on individual objects, but it can
and will easily get hung up on the first noinstr violation it finds and not
search further within one given function.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 21:48     ` Sean Christopherson
@ 2025-01-17  9:54       ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17  9:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On 14/01/25 13:48, Sean Christopherson wrote:
> On Tue, Jan 14, 2025, Sean Christopherson wrote:
>> On Tue, Jan 14, 2025, Valentin Schneider wrote:
>> > +/**
>> > + * is_kernel_noinstr_text - checks if the pointer address is located in the
>> > + *                    .noinstr section
>> > + *
>> > + * @addr: address to check
>> > + *
>> > + * Returns: true if the address is located in .noinstr, false otherwise.
>> > + */
>> > +static inline bool is_kernel_noinstr_text(unsigned long addr)
>> > +{
>> > +	return addr >= (unsigned long)__noinstr_text_start &&
>> > +	       addr < (unsigned long)__noinstr_text_end;
>> > +}
>>
>> This doesn't do the right thing for modules, which matters because KVM can be
>> built as a module on x86, and because context tracking understands transitions
>> to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
>> being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
>> or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
>> patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
>> use the wrong static path.
>>
>> I don't expect this to ever cause problems in practice, because patching code in
>> KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
>> actively running guest code, would be all kinds of crazy.  But I do think we
>> should plug the hole.
>>
>> If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
>> assume module code generally isn't allowed in the entry path, even via NMI?), one
>> idea would be to let KVM register its noinstr section for text poking.
>
> Another idea would be to track which keys/branches are tagged noinstr, i.e. generate
> the information at compile time instead of doing lookups at runtime.  The biggest
> downside I can think of is that it would require plumbing in the information to
> text_poke_bp_batch().

IIUC that's what I went for in v3:

https://lore.kernel.org/lkml/20241119153502.41361-11-vschneid@redhat.com

but, modules notwithstanding, simply checking if the patched instruction is
in .noinstr was a lot neater.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant
  2025-01-14 21:45   ` Dave Hansen
@ 2025-01-17 13:44     ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17 13:44 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list
  Cc: Peter Zijlstra, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Boris Ostrovsky, Josh Poimboeuf,
	Pawan Gupta, Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 14/01/25 13:45, Dave Hansen wrote:
> On 1/14/25 09:51, Valentin Schneider wrote:
>> +	cr4 = this_cpu_read(cpu_tlbstate.cr4);
>> +	asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
>> +	asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
>> +	/*
>> +	 * In lieu of not having the pinning crap, hard fail if CR4 doesn't
>> +	 * match the expected value. This ensures that anybody doing dodgy gets
>> +	 * the fallthrough check.
>> +	 */
>> +	BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));
>
> Let's say someone managed to write to cpu_tlbstate.cr4 where they
> cleared one of the pinned bits.
>
> Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly
> and also reset the cleared bits.
>
> After this patch, the first native_flush_tlb_global() can clear pinned
> bits, at least until native_write_cr4() gets called the next time. That
> seems like it'll undermine CR4 pinning at least somewhat.
>

The BUG_ON() should still catch any pinned bit mishandling, however...

> What keeps native_write_cr4() from being noinstr-compliant now? Is it
> just the WARN_ONCE()?
>

I don't think that's even an issue since __WARN_printf() wraps the print in
instrumentation_{begin,end}(). In v3 I made native_write_cr4() noinstr and
added a non-noinstr wrapper to be used in existing callsites.

AFAICT if acceptable we could make the whole thing noinstr and stick with
that; Peter, is there something I missed that made you write the handmade
noinstr CR4 RMW?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-14 18:16   ` Jann Horn
@ 2025-01-17 15:25     ` Valentin Schneider
  2025-01-17 15:52       ` Jann Horn
  2025-01-17 16:11       ` Uladzislau Rezki
  0 siblings, 2 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17 15:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 14/01/25 19:16, Jann Horn wrote:
> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> flush_tlb_kernel_range() IPIs.
>>
>> Given that CPUs executing in userspace do not access data in the vmalloc
>> range, these IPIs could be deferred until their next kernel entry.
>>
>> Deferral vs early entry danger zone
>> ===================================
>>
>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> and then accessed in early entry code.
>
> In other words, it needs a guarantee that no vmalloc allocations that
> have been created in the vmalloc region while the CPU was idle can
> then be accessed during early entry, right?

I'm not sure if that would be a problem (not an mm expert, please do
correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
deferred anyway.

So after vmapping something, I wouldn't expect isolated CPUs to have
invalid TLB entries for the newly vmapped page.

However, upon vunmap'ing something, the TLB flush is deferred, and thus
stale TLB entries can and will remain on isolated CPUs, up until they
execute the deferred flush themselves (IOW for the entire duration of the
"danger zone").

Does that make sense?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 15:25     ` Valentin Schneider
@ 2025-01-17 15:52       ` Jann Horn
  2025-01-17 16:53         ` Valentin Schneider
                           ` (2 more replies)
  2025-01-17 16:11       ` Uladzislau Rezki
  1 sibling, 3 replies; 86+ messages in thread
From: Jann Horn @ 2025-01-17 15:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
>
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.

flush_cache_vmap() is about stuff like flushing data caches on
architectures with virtually indexed caches; that doesn't do TLB
maintenance. When you look for its definition on x86 or arm64, you'll
see that they use the generic implementation which is simply an empty
inline function.

> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
>
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
>
> Does that make sense?

The design idea wrt TLB flushes in the vmap code is that you don't do
TLB flushes when you unmap stuff or when you map stuff, because doing
TLB flushes across the entire system on every vmap/vunmap would be a
bit costly; instead you just do batched TLB flushes in between, in
__purge_vmap_area_lazy().

In other words, the basic idea is that you can keep calling vmap() and
vunmap() a bunch of times without ever doing TLB flushes until you run
out of virtual memory in the vmap region; then you do one big TLB
flush, and afterwards you can reuse the free virtual address space for
new allocations again.

So if you "defer" that batched TLB flush for CPUs that are not
currently running in the kernel, I think the consequence is that those
CPUs may end up with incoherent TLB state after a reallocation of the
virtual address space.

Actually, I think this would mean that your optimization is disallowed
at least on arm64 - I'm not sure about the exact wording, but arm64
has a "break before make" rule that forbids conflicting writable
address translations or something like that.

(I said "until you run out of virtual memory in the vmap region", but
that's not actually true - see the comment above lazy_max_pages() for
an explanation of the actual heuristic. You might be able to tune that
a bit if you'd be significantly happier with less frequent
interruptions, or something along those lines.)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 15:25     ` Valentin Schneider
  2025-01-17 15:52       ` Jann Horn
@ 2025-01-17 16:11       ` Uladzislau Rezki
  2025-01-17 17:00         ` Valentin Schneider
  1 sibling, 1 reply; 86+ messages in thread
From: Uladzislau Rezki @ 2025-01-17 16:11 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
> 
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.
> 
> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
> 
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
> 
> Does that make sense?
> 
Probably i am missing something and need to have a look at your patches,
but how do you guarantee that no-one map same are that you defer for TLB
flushing?

As noted by Jann, we already defer a TLB flushing by backing freed areas
until certain threshold and just after we cross it we do a flush.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 15:52       ` Jann Horn
@ 2025-01-17 16:53         ` Valentin Schneider
  2025-02-19 15:05           ` Joel Fernandes
  2025-01-27 15:51         ` Will Deacon
  2025-02-10 18:36         ` Valentin Schneider
  2 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17 16:53 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>

Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
CPU accesses it during early entry;

> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>

On the bright side of things, arm64 is not as bad as x86 when it comes to
IPI'ing isolated CPUs :-) I'll add that to my notes, thanks!

> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 16:11       ` Uladzislau Rezki
@ 2025-01-17 17:00         ` Valentin Schneider
  2025-01-20 11:15           ` Uladzislau Rezki
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-17 17:00 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 17/01/25 17:11, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>>
> Probably i am missing something and need to have a look at your patches,
> but how do you guarantee that no-one map same are that you defer for TLB
> flushing?
>

That's the cool part: I don't :')

For deferring instruction patching IPIs, I (well Josh really) managed to
get instrumentation to back me up and catch any problematic area.

I looked into getting something similar for vmalloc region access in
.noinstr code, but I didn't get anywhere. I even tried using emulated
watchpoints on QEMU to watch the whole vmalloc range, but that went about
as well as you could expect.

That left me with staring at code. AFAICT the only vmap'd thing that is
accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
itself cannot be freed until the task exits - thus can't be subject to
invalidation when a task is entering kernelspace.

If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).

> As noted by Jann, we already defer a TLB flushing by backing freed areas
> until certain threshold and just after we cross it we do a flush.
>
> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-17  9:47     ` Valentin Schneider
@ 2025-01-17 17:15       ` Sean Christopherson
  2025-01-20 13:53         ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Sean Christopherson @ 2025-01-17 17:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025, Valentin Schneider wrote:
> On 14/01/25 13:13, Sean Christopherson wrote:
> > On Tue, Jan 14, 2025, Valentin Schneider wrote:
> >> +/**
> >> + * is_kernel_noinstr_text - checks if the pointer address is located in the
> >> + *                    .noinstr section
> >> + *
> >> + * @addr: address to check
> >> + *
> >> + * Returns: true if the address is located in .noinstr, false otherwise.
> >> + */
> >> +static inline bool is_kernel_noinstr_text(unsigned long addr)
> >> +{
> >> +	return addr >= (unsigned long)__noinstr_text_start &&
> >> +	       addr < (unsigned long)__noinstr_text_end;
> >> +}
> >
> > This doesn't do the right thing for modules, which matters because KVM can be
> > built as a module on x86, and because context tracking understands transitions
> > to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
> > being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
> > or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
> > patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
> > use the wrong static path.
> 
> AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions
> and thus such a static key usage should at the very least be caught and
> warned about by objtool - when this isn't built as a module.

That doesn't magically do the right thing though.  If KVM is built as a module,
is_kernel_noinstr_text() will get false negatives even for static keys/branches
that are annotaed as NOINSTR.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 17:00         ` Valentin Schneider
@ 2025-01-20 11:15           ` Uladzislau Rezki
  2025-01-20 16:09             ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Uladzislau Rezki @ 2025-01-20 11:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Uladzislau Rezki, Jann Horn, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
> On 17/01/25 17:11, Uladzislau Rezki wrote:
> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >>
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >>
> > Probably i am missing something and need to have a look at your patches,
> > but how do you guarantee that no-one map same are that you defer for TLB
> > flushing?
> >
> 
> That's the cool part: I don't :')
> 
Indeed, sounds unsafe :) Then we just do not need to free areas.

> For deferring instruction patching IPIs, I (well Josh really) managed to
> get instrumentation to back me up and catch any problematic area.
> 
> I looked into getting something similar for vmalloc region access in
> .noinstr code, but I didn't get anywhere. I even tried using emulated
> watchpoints on QEMU to watch the whole vmalloc range, but that went about
> as well as you could expect.
> 
> That left me with staring at code. AFAICT the only vmap'd thing that is
> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
> itself cannot be freed until the task exits - thus can't be subject to
> invalidation when a task is entering kernelspace.
> 
> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
> 
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-17 17:15       ` Sean Christopherson
@ 2025-01-20 13:53         ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-01-20 13:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Peter Zijlstra, Nicolas Saenz Julienne,
	Juergen Gross, Ajay Kaher, Alexey Makhalov, Russell King,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On 17/01/25 09:15, Sean Christopherson wrote:
> On Fri, Jan 17, 2025, Valentin Schneider wrote:
>> On 14/01/25 13:13, Sean Christopherson wrote:
>> > On Tue, Jan 14, 2025, Valentin Schneider wrote:
>> >> +/**
>> >> + * is_kernel_noinstr_text - checks if the pointer address is located in the
>> >> + *                    .noinstr section
>> >> + *
>> >> + * @addr: address to check
>> >> + *
>> >> + * Returns: true if the address is located in .noinstr, false otherwise.
>> >> + */
>> >> +static inline bool is_kernel_noinstr_text(unsigned long addr)
>> >> +{
>> >> +	return addr >= (unsigned long)__noinstr_text_start &&
>> >> +	       addr < (unsigned long)__noinstr_text_end;
>> >> +}
>> >
>> > This doesn't do the right thing for modules, which matters because KVM can be
>> > built as a module on x86, and because context tracking understands transitions
>> > to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
>> > being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
>> > or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
>> > patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
>> > use the wrong static path.
>>>
>> AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions
>> and thus such a static key usage should at the very least be caught and
>> warned about by objtool - when this isn't built as a module.
>
> That doesn't magically do the right thing though.  If KVM is built as a module,
> is_kernel_noinstr_text() will get false negatives even for static keys/branches
> that are annotaed as NOINSTR.

Quite so.

I've been looking at mod_mem_type & friends, I'm thinking adding a
MOD_NOINSTR_TEXT type might be overkill considering modules really
shouldn't be involved with early entry, KVM being the one exception.

Your suggestion to have a KVM-module-specific noinstr section sounds good
to me, I'll have a look at that.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-20 11:15           ` Uladzislau Rezki
@ 2025-01-20 16:09             ` Valentin Schneider
  2025-01-21 17:00               ` Uladzislau Rezki
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-20 16:09 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Uladzislau Rezki, Jann Horn, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 20/01/25 12:15, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
>> On 17/01/25 17:11, Uladzislau Rezki wrote:
>> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >>
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >>
>> > Probably i am missing something and need to have a look at your patches,
>> > but how do you guarantee that no-one map same are that you defer for TLB
>> > flushing?
>> >
>>
>> That's the cool part: I don't :')
>>
> Indeed, sounds unsafe :) Then we just do not need to free areas.
>
>> For deferring instruction patching IPIs, I (well Josh really) managed to
>> get instrumentation to back me up and catch any problematic area.
>>
>> I looked into getting something similar for vmalloc region access in
>> .noinstr code, but I didn't get anywhere. I even tried using emulated
>> watchpoints on QEMU to watch the whole vmalloc range, but that went about
>> as well as you could expect.
>>
>> That left me with staring at code. AFAICT the only vmap'd thing that is
>> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
>> itself cannot be freed until the task exits - thus can't be subject to
>> invalidation when a task is entering kernelspace.
>>
>> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
>>
> As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>

In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
single userspace application that will never enter the kernel, unless
forced to by some interference (e.g. IPI sent from a housekeeping CPU).

Increasing the lazy threshold would unfortunately only delay the
interference - housekeeping CPUs are free to run whatever, and so they will
eventually cause the lazy threshold to be hit and IPI all the CPUs,
including the isolated/NOHZ_FULL ones.

I was thinking maybe we could subdivide the vmap space into two regions
with their own thresholds, but a task may allocate/vmap stuff while on a HK
CPU and be moved to an isolated CPU afterwards, and also I still don't have
any strong guarantee about what accesses an isolated CPU can do in its
early entry code :(

> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option
  2025-01-14 17:51 ` [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
@ 2025-01-21 13:56   ` Frederic Weisbecker
  0 siblings, 0 replies; 86+ messages in thread
From: Frederic Weisbecker @ 2025-01-21 13:56 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Paul E . McKenney, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Le Tue, Jan 14, 2025 at 06:51:16PM +0100, Valentin Schneider a écrit :
> A later commit will reduce the size of the RCU watching counter to free up
> some bits for another purpose. Paul suggested adding a config option to
> test the extreme case where the counter is reduced to its minimum usable
> width for rcutorture to poke at, so do that.
> 
> Make it only configurable under RCU_EXPERT. While at it, add a comment to
> explain the layout of context_tracking->state.
> 
> Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  2025-01-14 17:51 ` [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
@ 2025-01-21 14:00   ` Frederic Weisbecker
  0 siblings, 0 replies; 86+ messages in thread
From: Frederic Weisbecker @ 2025-01-21 14:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Paul E . McKenney, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Le Tue, Jan 14, 2025 at 06:51:17PM +0100, Valentin Schneider a écrit :
> We now have an RCU_EXPERT config for testing small-sized RCU dynticks
> counter:  CONFIG_RCU_DYNTICKS_TORTURE.
> 
> Modify scenario TREE04 to exercise to use this config in order to test a
> ridiculously small counter (2 bits).
> 
> Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> index dc4985064b3ad..67caf4276bb01 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>  CONFIG_RCU_EXPERT=y
>  CONFIG_RCU_EQS_DEBUG=y
>  CONFIG_RCU_LAZY=y
> +CONFIG_RCU_DYNTICKS_TORTURE=y

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>


> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-20 16:09             ` Valentin Schneider
@ 2025-01-21 17:00               ` Uladzislau Rezki
  2025-01-24 15:22                 ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Uladzislau Rezki @ 2025-01-21 17:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Uladzislau Rezki, Jann Horn, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

> >
> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
> >
> 
> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
> single userspace application that will never enter the kernel, unless
> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
> 
> Increasing the lazy threshold would unfortunately only delay the
> interference - housekeeping CPUs are free to run whatever, and so they will
> eventually cause the lazy threshold to be hit and IPI all the CPUs,
> including the isolated/NOHZ_FULL ones.
> 
Do you have any testing results for your workload? I mean how much
potentially we can allocate. Again, maybe it is just enough to back
and once per-hour offload it.

Apart of that how critical IPIing CPUs affect your workloads?

> I was thinking maybe we could subdivide the vmap space into two regions
> with their own thresholds, but a task may allocate/vmap stuff while on a HK
> CPU and be moved to an isolated CPU afterwards, and also I still don't have
> any strong guarantee about what accesses an isolated CPU can do in its
> early entry code :(
> 
I agree this is not worth to play with a vmap space in terms of splitting it.

Sorry for later answer and thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-01-14 17:51 ` [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry Valentin Schneider
@ 2025-01-22  0:22   ` Frederic Weisbecker
  2025-01-22  1:04     ` Sean Christopherson
  2025-01-27 11:17     ` Valentin Schneider
  0 siblings, 2 replies; 86+ messages in thread
From: Frederic Weisbecker @ 2025-01-22  0:22 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
> ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't
> modify the actual CT state part context_tracking.state. This means that
> upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL
> transition only happens in ct_idle_exit().
> 
> One can note that ct_nmi_enter() can only ever be entered with the CT state
> as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the
> CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().

Are you sure? An NMI can fire between guest_state_enter_irqoff() and
__svm_vcpu_run(). And NMIs interrupting userspace don't call
enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode()
like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs
implementing common entry code, I can't speak for the others.

Unifying the behaviour between user and idle such that the IRQs/NMIs exit the
CT_STATE can be interesting but I fear this may not come for free. You would
need to save the old state on IRQ/NMI entry and restore it on exit.

Do we really need it?

Thanks.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-01-22  0:22   ` Frederic Weisbecker
@ 2025-01-22  1:04     ` Sean Christopherson
  2025-01-27 11:17     ` Valentin Schneider
  1 sibling, 0 replies; 86+ messages in thread
From: Sean Christopherson @ 2025-01-22  1:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Valentin Schneider, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Paolo Bonzini, Andy Lutomirski, Arnd Bergmann, Paul E. McKenney,
	Jason Baron, Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Juri Lelli,
	Clark Williams, Yair Podemsky, Tomas Glozar, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Kees Cook,
	Andrew Morton, Christoph Hellwig, Shuah Khan, Sami Tolvanen,
	Miguel Ojeda, Alice Ryhl, Mike Rapoport (Microsoft),
	Samuel Holland, Rong Xu, Nicolas Saenz Julienne,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

On Wed, Jan 22, 2025, Frederic Weisbecker wrote:
> Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
> > ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't
> > modify the actual CT state part context_tracking.state. This means that
> > upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL
> > transition only happens in ct_idle_exit().
> > 
> > One can note that ct_nmi_enter() can only ever be entered with the CT state
> > as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the
> > CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
> 
> Are you sure? An NMI can fire between guest_state_enter_irqoff() and
> __svm_vcpu_run().

Heh, technically, they can't.  On SVM, KVM clears GIF prior to svm_vcpu_enter_exit(),
and restores GIF=1 only after it returns.  I.e. NMIs are fully blocked _on SVM_.

VMX unfortunately doesn't provide GIF, and so NMIs can arrive at any time.  It's
infeasible for software to prevent them, so we're stuck with that.  [In theory,
KVM could deliberately generate an NMI and not do IRET so that NMIs are blocked,
but that would be beyond crazy].

> And NMIs interrupting userspace don't call enter_from_user_mode(). In fact
> they don't call irqentry_enter_from_user_mode() like regular IRQs but
> irqentry_nmi_enter() instead. Well that's for archs implementing common entry
> code, I can't speak for the others.
> 
> Unifying the behaviour between user and idle such that the IRQs/NMIs exit the
> CT_STATE can be interesting but I fear this may not come for free. You would
> need to save the old state on IRQ/NMI entry and restore it on exit.
> 
> Do we really need it?
> 
> Thanks.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
  2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
  2025-01-14 21:13   ` Sean Christopherson
  2025-01-14 21:26   ` Sean Christopherson
@ 2025-01-24 10:48   ` K Prateek Nayak
  2 siblings, 0 replies; 86+ messages in thread
From: K Prateek Nayak @ 2025-01-24 10:48 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list
  Cc: Peter Zijlstra, Nicolas Saenz Julienne, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Geert Uytterhoeven, Yosry Ahmed, Kirill A. Shutemov,
	Masami Hiramatsu (Google), Jinghao Jia, Luis Chamberlain,
	Randy Dunlap, Tiezhu Yang

Hello Valentin,

On 1/14/2025 11:21 PM, Valentin Schneider wrote:
> [..snip..]
> diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
> index c68245f8d77c5..931ade1dcbcc2 100644
> --- a/include/linux/context_tracking_work.h
> +++ b/include/linux/context_tracking_work.h
> @@ -5,12 +5,12 @@
>   #include <linux/bitops.h>
>   
>   enum {
> -	CT_WORK_n_OFFSET,
> +	CT_WORK_SYNC,

nit.

Shouldn't this be "CT_WORK_SYNC_OFFSET"?

I believe it gets fixed in Patch 29/30 when "CT_WORK_TLBI_OFFSET" is
added but perhaps that can be moved here to preserve bisection.

>   	CT_WORK_MAX_OFFSET
>   };
>   
>   enum ct_work {
> -	CT_WORK_n        = BIT(CT_WORK_n_OFFSET),
> +	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
>   	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
>   };
>   

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-21 17:00               ` Uladzislau Rezki
@ 2025-01-24 15:22                 ` Valentin Schneider
  2025-01-27 10:36                   ` Uladzislau Rezki
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-24 15:22 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Uladzislau Rezki, Jann Horn, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 21/01/25 18:00, Uladzislau Rezki wrote:
>> >
>> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
>> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>> >
>>
>> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
>> single userspace application that will never enter the kernel, unless
>> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
>>
>> Increasing the lazy threshold would unfortunately only delay the
>> interference - housekeeping CPUs are free to run whatever, and so they will
>> eventually cause the lazy threshold to be hit and IPI all the CPUs,
>> including the isolated/NOHZ_FULL ones.
>>
> Do you have any testing results for your workload? I mean how much
> potentially we can allocate. Again, maybe it is just enough to back
> and once per-hour offload it.
>

Potentially as much as you want... In our Openshift environments, you can
get any sort of container executing on the housekeeping CPUs and they're
free to do pretty much whatever they want. Per CPU isolation they're not
allowed/meant to disturb isolated CPUs, however.

> Apart of that how critical IPIing CPUs affect your workloads?
>

If I'm being pedantic, a single IPI to an isolated CPU breaks the
isolation. If we can't quiesce IPIs to isolated CPUs, then we can't
guarantee that whatever is running on the isolated CPUs is actually
isolated / shielded from third party interference.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-24 15:22                 ` Valentin Schneider
@ 2025-01-27 10:36                   ` Uladzislau Rezki
  0 siblings, 0 replies; 86+ messages in thread
From: Uladzislau Rezki @ 2025-01-27 10:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Uladzislau Rezki, Jann Horn, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 24, 2025 at 04:22:19PM +0100, Valentin Schneider wrote:
> On 21/01/25 18:00, Uladzislau Rezki wrote:
> >> >
> >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> >> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
> >> >
> >>
> >> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
> >> single userspace application that will never enter the kernel, unless
> >> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
> >>
> >> Increasing the lazy threshold would unfortunately only delay the
> >> interference - housekeeping CPUs are free to run whatever, and so they will
> >> eventually cause the lazy threshold to be hit and IPI all the CPUs,
> >> including the isolated/NOHZ_FULL ones.
> >>
> > Do you have any testing results for your workload? I mean how much
> > potentially we can allocate. Again, maybe it is just enough to back
> > and once per-hour offload it.
> >
> 
> Potentially as much as you want... In our Openshift environments, you can
> get any sort of container executing on the housekeeping CPUs and they're
> free to do pretty much whatever they want. Per CPU isolation they're not
> allowed/meant to disturb isolated CPUs, however.
> 
> > Apart of that how critical IPIing CPUs affect your workloads?
> >
> 
> If I'm being pedantic, a single IPI to an isolated CPU breaks the
> isolation. If we can't quiesce IPIs to isolated CPUs, then we can't
> guarantee that whatever is running on the isolated CPUs is actually
> isolated / shielded from third party interference.
> 
I see. I thought you are fixing some issue. I do not see a straight
forward way how to remove such "distortion". Probably we can block the
range which we defer for flushing. But it also can be problematic
because of other constraints.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-01-22  0:22   ` Frederic Weisbecker
  2025-01-22  1:04     ` Sean Christopherson
@ 2025-01-27 11:17     ` Valentin Schneider
  2025-02-07 17:06       ` Valentin Schneider
  1 sibling, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-01-27 11:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 22/01/25 01:22, Frederic Weisbecker wrote:
> Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
>> ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't
>> modify the actual CT state part context_tracking.state. This means that
>> upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL
>> transition only happens in ct_idle_exit().
>>
>> One can note that ct_nmi_enter() can only ever be entered with the CT state
>> as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the
>> CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
>
> Are you sure? An NMI can fire between guest_state_enter_irqoff() and
> __svm_vcpu_run().

Urgh, you're quite right.

> And NMIs interrupting userspace don't call
> enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode()
> like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs
> implementing common entry code, I can't speak for the others.
>

That I didn't realize, so thank you for pointing it out. Having another
look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case
when it really isn't :(

> Unifying the behaviour between user and idle such that the IRQs/NMIs exit the
> CT_STATE can be interesting but I fear this may not come for free. You would
> need to save the old state on IRQ/NMI entry and restore it on exit.
>

That's what I tried to avoid, but it sounds like there's no nice way around it.

> Do we really need it?
>

Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that
this leads the IPI deferral logic to observe a technically-out-of-sync sate
for remote CPUs. Consider:

  CPUx            CPUy
                    state := CT_STATE_IDLE
                    ...
                    ~>IRQ
                    ...
                    ct_nmi_enter()
                    [in the kernel proper by now]

  text_poke_bp_batch()
    ct_set_cpu_work(CPUy, CT_WORK_SYNC)
      READ CPUy ct->state
      `-> CT_IDLE_STATE
      `-> defer IPI


I thought this meant I would need to throw out the "defer IPIs if CPU is
idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST,
which is a bummer :(


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 15:52       ` Jann Horn
  2025-01-17 16:53         ` Valentin Schneider
@ 2025-01-27 15:51         ` Will Deacon
  2025-02-10 18:36         ` Valentin Schneider
  2 siblings, 0 replies; 86+ messages in thread
From: Will Deacon @ 2025-01-27 15:51 UTC (permalink / raw)
  To: Jann Horn
  Cc: Valentin Schneider, linux-kernel, x86, virtualization,
	linux-arm-kernel, loongarch, linux-riscv, linux-perf-users,
	xen-devel, kvm, linux-arch, rcu, linux-hardening, linux-mm,
	linux-kselftest, bpf, bcm-kernel-feedback-list, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Russell King, Catalin Marinas,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025 at 04:52:19PM +0100, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > On 14/01/25 19:16, Jann Horn wrote:
> > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> > >> flush_tlb_kernel_range() IPIs.
> > >>
> > >> Given that CPUs executing in userspace do not access data in the vmalloc
> > >> range, these IPIs could be deferred until their next kernel entry.
> > >>
> > >> Deferral vs early entry danger zone
> > >> ===================================
> > >>
> > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> > >> and then accessed in early entry code.
> > >
> > > In other words, it needs a guarantee that no vmalloc allocations that
> > > have been created in the vmalloc region while the CPU was idle can
> > > then be accessed during early entry, right?
> >
> > I'm not sure if that would be a problem (not an mm expert, please do
> > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> > deferred anyway.
> 
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
> 
> > So after vmapping something, I wouldn't expect isolated CPUs to have
> > invalid TLB entries for the newly vmapped page.
> >
> > However, upon vunmap'ing something, the TLB flush is deferred, and thus
> > stale TLB entries can and will remain on isolated CPUs, up until they
> > execute the deferred flush themselves (IOW for the entire duration of the
> > "danger zone").
> >
> > Does that make sense?
> 
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
> 
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
> 
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
> 
> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.

Yes, that would definitely be a problem. There's also the more obvious
issue that the CnP ("Common not Private") feature of some Arm CPUs means
that TLB entries can be shared between cores, so the whole idea of using
a CPU's exception level to predicate invalidation is flawed on such a
system.

Will

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-01-27 11:17     ` Valentin Schneider
@ 2025-02-07 17:06       ` Valentin Schneider
  2025-02-07 18:37         ` Frederic Weisbecker
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-07 17:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 27/01/25 12:17, Valentin Schneider wrote:
> On 22/01/25 01:22, Frederic Weisbecker wrote:
>> And NMIs interrupting userspace don't call
>> enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode()
>> like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs
>> implementing common entry code, I can't speak for the others.
>>
>
> That I didn't realize, so thank you for pointing it out. Having another
> look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case
> when it really isn't :(
>
>> Unifying the behaviour between user and idle such that the IRQs/NMIs exit the
>> CT_STATE can be interesting but I fear this may not come for free. You would
>> need to save the old state on IRQ/NMI entry and restore it on exit.
>>
>
> That's what I tried to avoid, but it sounds like there's no nice way around it.
>
>> Do we really need it?
>>
>
> Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that
> this leads the IPI deferral logic to observe a technically-out-of-sync sate
> for remote CPUs. Consider:
>
>   CPUx            CPUy
>                     state := CT_STATE_IDLE
>                     ...
>                     ~>IRQ
>                     ...
>                     ct_nmi_enter()
>                     [in the kernel proper by now]
>
>   text_poke_bp_batch()
>     ct_set_cpu_work(CPUy, CT_WORK_SYNC)
>       READ CPUy ct->state
>       `-> CT_IDLE_STATE
>       `-> defer IPI
>
>
> I thought this meant I would need to throw out the "defer IPIs if CPU is
> idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST,
> which is a bummer :(

Soooo I've been thinking...

Isn't

  (context_tracking.state & CT_RCU_WATCHING)

pretty much a proxy for knowing whether a CPU is executing in kernelspace,
including NMIs?

NMI interrupts userspace/VM/idle -> ct_nmi_enter()   -> it becomes true
IRQ interrupts idle              -> ct_irq_enter()   -> it becomes true
IRQ interrupts userspace         -> __ct_user_exit() -> it becomes true
IRQ interrupts VM                -> __ct_user_exit() -> it becomes true

IOW, if I gate setting deferred work by checking for this instead of
explicitely CT_STATE_KERNEL, "it should work" and prevent the
aforementioned issue? Or should I be out drinking instead? :-)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-02-07 17:06       ` Valentin Schneider
@ 2025-02-07 18:37         ` Frederic Weisbecker
  2025-02-10 17:36           ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Frederic Weisbecker @ 2025-02-07 18:37 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

Le Fri, Feb 07, 2025 at 06:06:45PM +0100, Valentin Schneider a écrit :
> On 27/01/25 12:17, Valentin Schneider wrote:
> > On 22/01/25 01:22, Frederic Weisbecker wrote:
> >> And NMIs interrupting userspace don't call
> >> enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode()
> >> like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs
> >> implementing common entry code, I can't speak for the others.
> >>
> >
> > That I didn't realize, so thank you for pointing it out. Having another
> > look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case
> > when it really isn't :(
> >
> >> Unifying the behaviour between user and idle such that the IRQs/NMIs exit the
> >> CT_STATE can be interesting but I fear this may not come for free. You would
> >> need to save the old state on IRQ/NMI entry and restore it on exit.
> >>
> >
> > That's what I tried to avoid, but it sounds like there's no nice way around it.
> >
> >> Do we really need it?
> >>
> >
> > Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that
> > this leads the IPI deferral logic to observe a technically-out-of-sync sate
> > for remote CPUs. Consider:
> >
> >   CPUx            CPUy
> >                     state := CT_STATE_IDLE
> >                     ...
> >                     ~>IRQ
> >                     ...
> >                     ct_nmi_enter()
> >                     [in the kernel proper by now]
> >
> >   text_poke_bp_batch()
> >     ct_set_cpu_work(CPUy, CT_WORK_SYNC)
> >       READ CPUy ct->state
> >       `-> CT_IDLE_STATE
> >       `-> defer IPI
> >
> >
> > I thought this meant I would need to throw out the "defer IPIs if CPU is
> > idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST,
> > which is a bummer :(
> 
> Soooo I've been thinking...
> 
> Isn't
> 
>   (context_tracking.state & CT_RCU_WATCHING)
> 
> pretty much a proxy for knowing whether a CPU is executing in kernelspace,
> including NMIs?

You got it!

> 
> NMI interrupts userspace/VM/idle -> ct_nmi_enter()   -> it becomes true
> IRQ interrupts idle              -> ct_irq_enter()   -> it becomes true
> IRQ interrupts userspace         -> __ct_user_exit() -> it becomes true
> IRQ interrupts VM                -> __ct_user_exit() -> it becomes true
> 
> IOW, if I gate setting deferred work by checking for this instead of
> explicitely CT_STATE_KERNEL, "it should work" and prevent the
> aforementioned issue? Or should I be out drinking instead? :-)

Exactly it should work! Now that doesn't mean you can't go out
for a drink :-)

Thanks.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  2025-02-07 18:37         ` Frederic Weisbecker
@ 2025-02-10 17:36           ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-02-10 17:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Paul E. McKenney, Jason Baron, Steven Rostedt,
	Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 07/02/25 19:37, Frederic Weisbecker wrote:
> Le Fri, Feb 07, 2025 at 06:06:45PM +0100, Valentin Schneider a écrit :
>>
>> Soooo I've been thinking...
>>
>> Isn't
>>
>>   (context_tracking.state & CT_RCU_WATCHING)
>>
>> pretty much a proxy for knowing whether a CPU is executing in kernelspace,
>> including NMIs?
>
> You got it!
>

Yay!

>>
>> NMI interrupts userspace/VM/idle -> ct_nmi_enter()   -> it becomes true
>> IRQ interrupts idle              -> ct_irq_enter()   -> it becomes true
>> IRQ interrupts userspace         -> __ct_user_exit() -> it becomes true
>> IRQ interrupts VM                -> __ct_user_exit() -> it becomes true
>>
>> IOW, if I gate setting deferred work by checking for this instead of
>> explicitely CT_STATE_KERNEL, "it should work" and prevent the
>> aforementioned issue? Or should I be out drinking instead? :-)
>
> Exactly it should work! Now that doesn't mean you can't go out
> for a drink :-)
>

Well, drinks were had very shortly after sending this email :D

> Thanks.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 15:52       ` Jann Horn
  2025-01-17 16:53         ` Valentin Schneider
  2025-01-27 15:51         ` Will Deacon
@ 2025-02-10 18:36         ` Valentin Schneider
  2025-02-10 22:08           ` Jann Horn
  2 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-10 18:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>
> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>
> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)

I've been thinking some more (this is your cue to grab a brown paper bag)...

Experimentation (unmap the whole VMALLOC range upon return to userspace,
see what explodes upon entry into the kernel) suggests that the early entry
"danger zone" should only access the vmaped stack, which itself isn't an
issue.

That is obviously just a test on one system configuration, and the problem
I'm facing is trying put in place /some/ form of instrumentation that would
at the very least cause a warning for any future patch that would introduce
a vmap'd access in early entry code. That, or a complete mitigation that
prevents those accesses altogether.

What if isolated CPUs unconditionally did a TLBi as late as possible in
the stack right before returning to userspace? This would mean that upon
re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
range translation - with the exception of whatever lies between the
last-minute flush and the actual userspace entry, which should be feasible
to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
could be entirely silenced if it targets an isolated CPU.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-10 18:36         ` Valentin Schneider
@ 2025-02-10 22:08           ` Jann Horn
  2025-02-11 13:33             ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Jann Horn @ 2025-02-10 22:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
> What if isolated CPUs unconditionally did a TLBi as late as possible in
> the stack right before returning to userspace? This would mean that upon
> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
> range translation - with the exception of whatever lies between the
> last-minute flush and the actual userspace entry, which should be feasible
> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
> could be entirely silenced if it targets an isolated CPU.

Two issues with that:

1. I think the "Common not Private" feature Will Deacon referred to is
incompatible with this idea:
<https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
says "When the CnP bit is set, the software promises to use the ASIDs
and VMIDs in the same way on all processors, which allows the TLB
entries that are created by one processor to be used by another"

2. It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to
assume that the CPU might be populating random TLB entries all over
the place.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-10 22:08           ` Jann Horn
@ 2025-02-11 13:33             ` Valentin Schneider
  2025-02-11 14:03               ` Mark Rutland
  2025-02-11 14:22               ` Dave Hansen
  0 siblings, 2 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-02-11 13:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 10/02/25 23:08, Jann Horn wrote:
> On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> What if isolated CPUs unconditionally did a TLBi as late as possible in
>> the stack right before returning to userspace? This would mean that upon
>> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
>> range translation - with the exception of whatever lies between the
>> last-minute flush and the actual userspace entry, which should be feasible
>> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
>> could be entirely silenced if it targets an isolated CPU.
>
> Two issues with that:
>

Firstly, thank you for entertaining the idea :-)

> 1. I think the "Common not Private" feature Will Deacon referred to is
> incompatible with this idea:
> <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
> says "When the CnP bit is set, the software promises to use the ASIDs
> and VMIDs in the same way on all processors, which allows the TLB
> entries that are created by one processor to be used by another"
>


Sorry for being obtuse - I can understand inconsistent TLB states (old vs
new translations being present in separate TLBs) due to not sending the
flush IPI causing an issue with that, but not "flushing early". Even if TLB
entries can be shared/accessed between CPUs, a CPU should be allowed not to
have a shared entry in its TLB - what am I missing?

> 2. It's wrong to assume that TLB entries are only populated for
> addresses you access - thanks to speculative execution, you have to
> assume that the CPU might be populating random TLB entries all over
> the place.

Gotta love speculation. Now it is supposed to be limited to genuinely
accessible data & code, right? Say theoretically we have a full TLBi as
literally the last thing before doing the return-to-userspace, speculation
should be limited to executing maybe bits of the return-from-userspace
code?

Furthermore, I would hope that once a CPU is executing in userspace, it's
not going to populate the TLB with kernel address translations - AIUI the
whole vulnerability mitigation debacle was about preventing this sort of
thing.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-11 13:33             ` Valentin Schneider
@ 2025-02-11 14:03               ` Mark Rutland
  2025-02-11 16:09                 ` Valentin Schneider
  2025-02-11 14:22               ` Dave Hansen
  1 sibling, 1 reply; 86+ messages in thread
From: Mark Rutland @ 2025-02-11 14:03 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
> On 10/02/25 23:08, Jann Horn wrote:
> > On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> What if isolated CPUs unconditionally did a TLBi as late as possible in
> >> the stack right before returning to userspace? This would mean that upon
> >> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
> >> range translation - with the exception of whatever lies between the
> >> last-minute flush and the actual userspace entry, which should be feasible
> >> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
> >> could be entirely silenced if it targets an isolated CPU.
> >
> > Two issues with that:
> 
> Firstly, thank you for entertaining the idea :-)
> 
> > 1. I think the "Common not Private" feature Will Deacon referred to is
> > incompatible with this idea:
> > <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
> > says "When the CnP bit is set, the software promises to use the ASIDs
> > and VMIDs in the same way on all processors, which allows the TLB
> > entries that are created by one processor to be used by another"
> 
> Sorry for being obtuse - I can understand inconsistent TLB states (old vs
> new translations being present in separate TLBs) due to not sending the
> flush IPI causing an issue with that, but not "flushing early". Even if TLB
> entries can be shared/accessed between CPUs, a CPU should be allowed not to
> have a shared entry in its TLB - what am I missing?
> 
> > 2. It's wrong to assume that TLB entries are only populated for
> > addresses you access - thanks to speculative execution, you have to
> > assume that the CPU might be populating random TLB entries all over
> > the place.
> 
> Gotta love speculation. Now it is supposed to be limited to genuinely
> accessible data & code, right? Say theoretically we have a full TLBi as
> literally the last thing before doing the return-to-userspace, speculation
> should be limited to executing maybe bits of the return-from-userspace
> code?

I think it's easier to ignore speculation entirely, and just assume that
the MMU can arbitrarily fill TLB entries from any page table entries
which are valid/accessible in the active page tables. Hardware
prefetchers can do that regardless of the specific path of speculative
execution.

Thus TLB fills are not limited to VAs which would be used on that
return-to-userspace path.

> Furthermore, I would hope that once a CPU is executing in userspace, it's
> not going to populate the TLB with kernel address translations - AIUI the
> whole vulnerability mitigation debacle was about preventing this sort of
> thing.

The CPU can definitely do that; the vulnerability mitigations are all
about what userspace can observe rather than what the CPU can do in the
background. Additionally, there are features like SPE and TRBE that use
kernel addresses while the CPU is executing userspace instructions.

The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear
about that in section D8.16 "Translation Lookaside Buff", where it says
(among other things):

  When address translation is enabled, if a translation table entry
  meets all of the following requirements, then that translation table
  entry is permitted to be cached in a TLB or intermediate TLB caching
  structure at any time:
  • The translation table entry itself does not generate a Translation
    fault, an Address size fault, or an Access flag fault.
  • The translation table entry is not from a translation regime
    configured by an Exception level that is lower than the current
    Exception level.

Here "permitted to be cached in a TLB" also implies that the HW is
allowed to fetch the translation tabl entry (which is what ARM call page
table entries).

The PDF can be found at:

  https://developer.arm.com/documentation/ddi0487/la/?lang=en

Mark.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-11 13:33             ` Valentin Schneider
  2025-02-11 14:03               ` Mark Rutland
@ 2025-02-11 14:22               ` Dave Hansen
  2025-02-11 16:10                 ` Valentin Schneider
  2025-02-18 22:40                 ` Valentin Schneider
  1 sibling, 2 replies; 86+ messages in thread
From: Dave Hansen @ 2025-02-11 14:22 UTC (permalink / raw)
  To: Valentin Schneider, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 2/11/25 05:33, Valentin Schneider wrote:
>> 2. It's wrong to assume that TLB entries are only populated for
>> addresses you access - thanks to speculative execution, you have to
>> assume that the CPU might be populating random TLB entries all over
>> the place.
> Gotta love speculation. Now it is supposed to be limited to genuinely
> accessible data & code, right? Say theoretically we have a full TLBi as
> literally the last thing before doing the return-to-userspace, speculation
> should be limited to executing maybe bits of the return-from-userspace
> code?

In practice, it's mostly limited like that.

Architecturally, there are no promises from the CPU. It is within its
rights to cache anything from the page tables at any time. If it's in
the CR3 tree, it's fair game.

> Furthermore, I would hope that once a CPU is executing in userspace, it's
> not going to populate the TLB with kernel address translations - AIUI the
> whole vulnerability mitigation debacle was about preventing this sort of
> thing.

Nope, unfortunately. There's two big exception to this. First, "implicit
supervisor-mode accesses". There are structures for which the CPU gets a
virtual address and accesses it even while userspace is running. The LDT
and GDT are the most obvious examples, but there are some less
ubiquitous ones like the buffers for PEBS events.

Second, remember that user versus supervisor is determined *BY* the page
tables. Before Linear Address Space Separation (LASS), all virtual
memory accesses walk the page tables, even userspace accesses to kernel
addresses.  The User/Supervisor bit is *in* the page tables, of course.

A userspace access to a kernel address results in a page walk and the
CPU is completely free to cache all or part of that page walk. A
Meltdown-style _speculative_ userspace access to kernel memory won't
generate a fault either. It won't leak data like it used to, of course,
but it can still walk the page tables. That's one reason LASS is needed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-11 14:03               ` Mark Rutland
@ 2025-02-11 16:09                 ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-02-11 16:09 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, Boris Ostrovsky,
	Josh Poimboeuf, Pawan Gupta, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Arnd Bergmann, Frederic Weisbecker,
	Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Juri Lelli, Clark Williams, Yair Podemsky, Tomas Glozar,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Kees Cook, Andrew Morton, Christoph Hellwig, Shuah Khan,
	Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 11/02/25 14:03, Mark Rutland wrote:
> On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
>> On 10/02/25 23:08, Jann Horn wrote:
>> > 2. It's wrong to assume that TLB entries are only populated for
>> > addresses you access - thanks to speculative execution, you have to
>> > assume that the CPU might be populating random TLB entries all over
>> > the place.
>>
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> I think it's easier to ignore speculation entirely, and just assume that
> the MMU can arbitrarily fill TLB entries from any page table entries
> which are valid/accessible in the active page tables. Hardware
> prefetchers can do that regardless of the specific path of speculative
> execution.
>
> Thus TLB fills are not limited to VAs which would be used on that
> return-to-userspace path.
>
>> Furthermore, I would hope that once a CPU is executing in userspace, it's
>> not going to populate the TLB with kernel address translations - AIUI the
>> whole vulnerability mitigation debacle was about preventing this sort of
>> thing.
>
> The CPU can definitely do that; the vulnerability mitigations are all
> about what userspace can observe rather than what the CPU can do in the
> background. Additionally, there are features like SPE and TRBE that use
> kernel addresses while the CPU is executing userspace instructions.
>
> The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear
> about that in section D8.16 "Translation Lookaside Buff", where it says
> (among other things):
>
>   When address translation is enabled, if a translation table entry
>   meets all of the following requirements, then that translation table
>   entry is permitted to be cached in a TLB or intermediate TLB caching
>   structure at any time:
>   • The translation table entry itself does not generate a Translation
>     fault, an Address size fault, or an Access flag fault.
>   • The translation table entry is not from a translation regime
>     configured by an Exception level that is lower than the current
>     Exception level.
>
> Here "permitted to be cached in a TLB" also implies that the HW is
> allowed to fetch the translation tabl entry (which is what ARM call page
> table entries).
>

That's actually fairly clear all things considered, thanks for the
education and for fishing out the relevant DDI section!

> The PDF can be found at:
>
>   https://developer.arm.com/documentation/ddi0487/la/?lang=en
>
> Mark.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-11 14:22               ` Dave Hansen
@ 2025-02-11 16:10                 ` Valentin Schneider
  2025-02-18 22:40                 ` Valentin Schneider
  1 sibling, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-02-11 16:10 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 11/02/25 06:22, Dave Hansen wrote:
> On 2/11/25 05:33, Valentin Schneider wrote:
>>> 2. It's wrong to assume that TLB entries are only populated for
>>> addresses you access - thanks to speculative execution, you have to
>>> assume that the CPU might be populating random TLB entries all over
>>> the place.
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> In practice, it's mostly limited like that.
>
> Architecturally, there are no promises from the CPU. It is within its
> rights to cache anything from the page tables at any time. If it's in
> the CR3 tree, it's fair game.
>
>> Furthermore, I would hope that once a CPU is executing in userspace, it's
>> not going to populate the TLB with kernel address translations - AIUI the
>> whole vulnerability mitigation debacle was about preventing this sort of
>> thing.
>
> Nope, unfortunately. There's two big exception to this. First, "implicit
> supervisor-mode accesses". There are structures for which the CPU gets a
> virtual address and accesses it even while userspace is running. The LDT
> and GDT are the most obvious examples, but there are some less
> ubiquitous ones like the buffers for PEBS events.
>
> Second, remember that user versus supervisor is determined *BY* the page
> tables. Before Linear Address Space Separation (LASS), all virtual
> memory accesses walk the page tables, even userspace accesses to kernel
> addresses.  The User/Supervisor bit is *in* the page tables, of course.
>
> A userspace access to a kernel address results in a page walk and the
> CPU is completely free to cache all or part of that page walk. A
> Meltdown-style _speculative_ userspace access to kernel memory won't
> generate a fault either. It won't leak data like it used to, of course,
> but it can still walk the page tables. That's one reason LASS is needed.

Bummer, now I have at least two architectures proving me wrong :-) Thank
you as well for the education, I really appreciate it.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-11 14:22               ` Dave Hansen
  2025-02-11 16:10                 ` Valentin Schneider
@ 2025-02-18 22:40                 ` Valentin Schneider
  2025-02-19  0:39                   ` Dave Hansen
  1 sibling, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-18 22:40 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 11/02/25 06:22, Dave Hansen wrote:
> On 2/11/25 05:33, Valentin Schneider wrote:
>>> 2. It's wrong to assume that TLB entries are only populated for
>>> addresses you access - thanks to speculative execution, you have to
>>> assume that the CPU might be populating random TLB entries all over
>>> the place.
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> In practice, it's mostly limited like that.
>
> Architecturally, there are no promises from the CPU. It is within its
> rights to cache anything from the page tables at any time. If it's in
> the CR3 tree, it's fair game.
>

So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
executing in userspace?

AIUI that's the case with kPTI - the remaining kernel pages should mostly
be .entry.text and cpu_entry_area, at least for x86.

It sounds like it wouldn't do much for arm64 though, if with CnP a CPU executing in
userspace and with the user/trampoline page table installed can still use
TLB entries of another CPU executing in kernelspace with the kernel page
table installed.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-18 22:40                 ` Valentin Schneider
@ 2025-02-19  0:39                   ` Dave Hansen
  2025-02-19 15:13                     ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-02-19  0:39 UTC (permalink / raw)
  To: Valentin Schneider, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 2/18/25 14:40, Valentin Schneider wrote:
>> In practice, it's mostly limited like that.
>>
>> Architecturally, there are no promises from the CPU. It is within its
>> rights to cache anything from the page tables at any time. If it's in
>> the CR3 tree, it's fair game.
>>
> So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
> executing in userspace?
> 
> AIUI that's the case with kPTI - the remaining kernel pages should mostly
> be .entry.text and cpu_entry_area, at least for x86.

Having part of VMEMMAP not in the CR3 tree should be harmless while
running userspace. VMEMMAP is a purely software structure; the hardware
doesn't do implicit supervisor accesses to it. It's also not needed in
super early entry.

Maybe I missed part of the discussion though. Is VMEMMAP your only
concern? I would have guessed that the more generic vmalloc()
functionality would be harder to pin down.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-01-17 16:53         ` Valentin Schneider
@ 2025-02-19 15:05           ` Joel Fernandes
  2025-02-19 16:18             ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Joel Fernandes @ 2025-02-19 15:05 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
> On 17/01/25 16:52, Jann Horn wrote:
> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >
> > flush_cache_vmap() is about stuff like flushing data caches on
> > architectures with virtually indexed caches; that doesn't do TLB
> > maintenance. When you look for its definition on x86 or arm64, you'll
> > see that they use the generic implementation which is simply an empty
> > inline function.
> >
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >
> > The design idea wrt TLB flushes in the vmap code is that you don't do
> > TLB flushes when you unmap stuff or when you map stuff, because doing
> > TLB flushes across the entire system on every vmap/vunmap would be a
> > bit costly; instead you just do batched TLB flushes in between, in
> > __purge_vmap_area_lazy().
> >
> > In other words, the basic idea is that you can keep calling vmap() and
> > vunmap() a bunch of times without ever doing TLB flushes until you run
> > out of virtual memory in the vmap region; then you do one big TLB
> > flush, and afterwards you can reuse the free virtual address space for
> > new allocations again.
> >
> > So if you "defer" that batched TLB flush for CPUs that are not
> > currently running in the kernel, I think the consequence is that those
> > CPUs may end up with incoherent TLB state after a reallocation of the
> > virtual address space.
> >
> 
> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
> CPU accesses it during early entry;

So the issue is:

CPU1: unmappes vmalloc page X which was previously mapped to physical page
P1.

CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
threshold and sending out IPIs. It then goes ahead and does an allocation
that maps the same virtual page X to physical page P2.

CPU3 is isolated and executes some early entry code before receving said IPIs
which are supposedly deferred by Valentin's patches.

It does not receive the IPI becuase it is deferred, thus access by early
entry code to page X on this CPU results in a UAF access to P1.

Is that the issue?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19  0:39                   ` Dave Hansen
@ 2025-02-19 15:13                     ` Valentin Schneider
  2025-02-19 20:25                       ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-19 15:13 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 18/02/25 16:39, Dave Hansen wrote:
> On 2/18/25 14:40, Valentin Schneider wrote:
>>> In practice, it's mostly limited like that.
>>>
>>> Architecturally, there are no promises from the CPU. It is within its
>>> rights to cache anything from the page tables at any time. If it's in
>>> the CR3 tree, it's fair game.
>>>
>> So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
>> executing in userspace?
>>
>> AIUI that's the case with kPTI - the remaining kernel pages should mostly
>> be .entry.text and cpu_entry_area, at least for x86.
>
> Having part of VMEMMAP not in the CR3 tree should be harmless while
> running userspace. VMEMMAP is a purely software structure; the hardware
> doesn't do implicit supervisor accesses to it. It's also not needed in
> super early entry.
>
> Maybe I missed part of the discussion though. Is VMEMMAP your only
> concern? I would have guessed that the more generic vmalloc()
> functionality would be harder to pin down.

Urgh, that'll teach me to send emails that late - I did indeed mean the
vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
kPTI page table and AFAICT the page table swap is done before the actual vmap'd
stack (CONFIG_VMAP_STACK=y) gets used.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19 15:05           ` Joel Fernandes
@ 2025-02-19 16:18             ` Valentin Schneider
  2025-02-19 17:08               ` Joel Fernandes
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-19 16:18 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 19/02/25 10:05, Joel Fernandes wrote:
> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>> On 17/01/25 16:52, Jann Horn wrote:
>> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >
>> > flush_cache_vmap() is about stuff like flushing data caches on
>> > architectures with virtually indexed caches; that doesn't do TLB
>> > maintenance. When you look for its definition on x86 or arm64, you'll
>> > see that they use the generic implementation which is simply an empty
>> > inline function.
>> >
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >
>> > The design idea wrt TLB flushes in the vmap code is that you don't do
>> > TLB flushes when you unmap stuff or when you map stuff, because doing
>> > TLB flushes across the entire system on every vmap/vunmap would be a
>> > bit costly; instead you just do batched TLB flushes in between, in
>> > __purge_vmap_area_lazy().
>> >
>> > In other words, the basic idea is that you can keep calling vmap() and
>> > vunmap() a bunch of times without ever doing TLB flushes until you run
>> > out of virtual memory in the vmap region; then you do one big TLB
>> > flush, and afterwards you can reuse the free virtual address space for
>> > new allocations again.
>> >
>> > So if you "defer" that batched TLB flush for CPUs that are not
>> > currently running in the kernel, I think the consequence is that those
>> > CPUs may end up with incoherent TLB state after a reallocation of the
>> > virtual address space.
>> >
>>
>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>> CPU accesses it during early entry;
>
> So the issue is:
>
> CPU1: unmappes vmalloc page X which was previously mapped to physical page
> P1.
>
> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
> threshold and sending out IPIs. It then goes ahead and does an allocation
> that maps the same virtual page X to physical page P2.
>
> CPU3 is isolated and executes some early entry code before receving said IPIs
> which are supposedly deferred by Valentin's patches.
>
> It does not receive the IPI becuase it is deferred, thus access by early
> entry code to page X on this CPU results in a UAF access to P1.
>
> Is that the issue?
>

Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
early entry code - testing says it's not the case, but I haven't found a
way to instrumentally verify this.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19 16:18             ` Valentin Schneider
@ 2025-02-19 17:08               ` Joel Fernandes
  2025-02-19 20:32                 ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Joel Fernandes @ 2025-02-19 17:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang



On 2/19/2025 11:18 AM, Valentin Schneider wrote:
> On 19/02/25 10:05, Joel Fernandes wrote:
>> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>>> On 17/01/25 16:52, Jann Horn wrote:
>>>> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>>>> On 14/01/25 19:16, Jann Horn wrote:
>>>>>> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>>>>>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>>>>>>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>>>>>>> flush_tlb_kernel_range() IPIs.
>>>>>>>
>>>>>>> Given that CPUs executing in userspace do not access data in the vmalloc
>>>>>>> range, these IPIs could be deferred until their next kernel entry.
>>>>>>>
>>>>>>> Deferral vs early entry danger zone
>>>>>>> ===================================
>>>>>>>
>>>>>>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>>>>>>> and then accessed in early entry code.
>>>>>>
>>>>>> In other words, it needs a guarantee that no vmalloc allocations that
>>>>>> have been created in the vmalloc region while the CPU was idle can
>>>>>> then be accessed during early entry, right?
>>>>>
>>>>> I'm not sure if that would be a problem (not an mm expert, please do
>>>>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>>>>> deferred anyway.
>>>>
>>>> flush_cache_vmap() is about stuff like flushing data caches on
>>>> architectures with virtually indexed caches; that doesn't do TLB
>>>> maintenance. When you look for its definition on x86 or arm64, you'll
>>>> see that they use the generic implementation which is simply an empty
>>>> inline function.
>>>>
>>>>> So after vmapping something, I wouldn't expect isolated CPUs to have
>>>>> invalid TLB entries for the newly vmapped page.
>>>>>
>>>>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>>>>> stale TLB entries can and will remain on isolated CPUs, up until they
>>>>> execute the deferred flush themselves (IOW for the entire duration of the
>>>>> "danger zone").
>>>>>
>>>>> Does that make sense?
>>>>
>>>> The design idea wrt TLB flushes in the vmap code is that you don't do
>>>> TLB flushes when you unmap stuff or when you map stuff, because doing
>>>> TLB flushes across the entire system on every vmap/vunmap would be a
>>>> bit costly; instead you just do batched TLB flushes in between, in
>>>> __purge_vmap_area_lazy().
>>>>
>>>> In other words, the basic idea is that you can keep calling vmap() and
>>>> vunmap() a bunch of times without ever doing TLB flushes until you run
>>>> out of virtual memory in the vmap region; then you do one big TLB
>>>> flush, and afterwards you can reuse the free virtual address space for
>>>> new allocations again.
>>>>
>>>> So if you "defer" that batched TLB flush for CPUs that are not
>>>> currently running in the kernel, I think the consequence is that those
>>>> CPUs may end up with incoherent TLB state after a reallocation of the
>>>> virtual address space.
>>>>
>>>
>>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>>> CPU accesses it during early entry;
>>
>> So the issue is:
>>
>> CPU1: unmappes vmalloc page X which was previously mapped to physical page
>> P1.
>>
>> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
>> threshold and sending out IPIs. It then goes ahead and does an allocation
>> that maps the same virtual page X to physical page P2.
>>
>> CPU3 is isolated and executes some early entry code before receving said IPIs
>> which are supposedly deferred by Valentin's patches.
>>
>> It does not receive the IPI becuase it is deferred, thus access by early
>> entry code to page X on this CPU results in a UAF access to P1.
>>
>> Is that the issue?
>>
> 
> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
> early entry code - testing says it's not the case, but I haven't found a
> way to instrumentally verify this.
Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying,
but yeah it is subtle and there could be more than one way of solving it. Too
much 'fun' ;)

 - Joel



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19 15:13                     ` Valentin Schneider
@ 2025-02-19 20:25                       ` Dave Hansen
  2025-02-20 17:10                         ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-02-19 20:25 UTC (permalink / raw)
  To: Valentin Schneider, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 2/19/25 07:13, Valentin Schneider wrote:
>> Maybe I missed part of the discussion though. Is VMEMMAP your only
>> concern? I would have guessed that the more generic vmalloc()
>> functionality would be harder to pin down.
> Urgh, that'll teach me to send emails that late - I did indeed mean the
> vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
> kPTI page table and AFAICT the page table swap is done before the actual vmap'd
> stack (CONFIG_VMAP_STACK=y) gets used.

OK, so rewriting your question... ;)

> So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is
> executing in userspace?

The LDT and maybe the PEBS buffers are the only implicit supervisor
accesses to vmalloc()'d memory that I can think of. But those are both
handled specially and shouldn't ever get zapped while in use. The LDT
replacement has its own IPIs separate from TLB flushing.

But I'm actually not all that worried about accesses while actually
running userspace. It's that "danger zone" in the kernel between entry
and when the TLB might have dangerous garbage in it.

BTW, I hope this whole thing is turned off on 32-bit. There, we can
actually take and handle faults on the vmalloc() area. If you get one of
those faults in your "danger zone", it'll start running page fault code
which will branch out to god-knows-where and certainly isn't noinstr.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19 17:08               ` Joel Fernandes
@ 2025-02-19 20:32                 ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-02-19 20:32 UTC (permalink / raw)
  To: Joel Fernandes, Valentin Schneider
  Cc: Jann Horn, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 2/19/25 09:08, Joel Fernandes wrote:
>> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
>> early entry code - testing says it's not the case, but I haven't found a
>> way to instrumentally verify this.
> Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying,
> but yeah it is subtle and there could be more than one way of solving it. Too
> much 'fun' 😉
For debugging, you could just make a copy of part or all of the page
tables and run the NOHZ_FULL tasks from those while they're in
userspace. Then, instead of flushing the TLB in the deferred work, you
switch over to the "real" page tables.

That would _behave_ like a CPU with a big TLB and really old, crusty TLB
entries from the last time the kernel ran.

BTW, the other option for all of this is just to say that if you want
IPI-free TLB flushing that you need to go buy some hardware with it as
opposed to all of this complexity.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-19 20:25                       ` Dave Hansen
@ 2025-02-20 17:10                         ` Valentin Schneider
  2025-02-20 17:38                           ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-02-20 17:10 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 19/02/25 12:25, Dave Hansen wrote:
> On 2/19/25 07:13, Valentin Schneider wrote:
>>> Maybe I missed part of the discussion though. Is VMEMMAP your only
>>> concern? I would have guessed that the more generic vmalloc()
>>> functionality would be harder to pin down.
>> Urgh, that'll teach me to send emails that late - I did indeed mean the
>> vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
>> kPTI page table and AFAICT the page table swap is done before the actual vmap'd
>> stack (CONFIG_VMAP_STACK=y) gets used.
>
> OK, so rewriting your question... ;)
>
>> So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is
>> executing in userspace?
>
> The LDT and maybe the PEBS buffers are the only implicit supervisor
> accesses to vmalloc()'d memory that I can think of. But those are both
> handled specially and shouldn't ever get zapped while in use. The LDT
> replacement has its own IPIs separate from TLB flushing.
>
> But I'm actually not all that worried about accesses while actually
> running userspace. It's that "danger zone" in the kernel between entry
> and when the TLB might have dangerous garbage in it.
>

So say we have kPTI, thus no vmalloc() mapped in CR3 when running
userspace, and do a full TLB flush right before switching to userspace -
could the TLB still end up with vmalloc()-range-related entries when we're
back in the kernel and going through the danger zone?

> BTW, I hope this whole thing is turned off on 32-bit. There, we can
> actually take and handle faults on the vmalloc() area. If you get one of
> those faults in your "danger zone", it'll start running page fault code
> which will branch out to god-knows-where and certainly isn't noinstr.

Sounds... Fun. Thanks for pointing out the landmines.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-20 17:10                         ` Valentin Schneider
@ 2025-02-20 17:38                           ` Dave Hansen
  2025-02-26 16:52                             ` Valentin Schneider
  2025-03-25 17:52                             ` Valentin Schneider
  0 siblings, 2 replies; 86+ messages in thread
From: Dave Hansen @ 2025-02-20 17:38 UTC (permalink / raw)
  To: Valentin Schneider, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 2/20/25 09:10, Valentin Schneider wrote:
>> The LDT and maybe the PEBS buffers are the only implicit supervisor
>> accesses to vmalloc()'d memory that I can think of. But those are both
>> handled specially and shouldn't ever get zapped while in use. The LDT
>> replacement has its own IPIs separate from TLB flushing.
>>
>> But I'm actually not all that worried about accesses while actually
>> running userspace. It's that "danger zone" in the kernel between entry
>> and when the TLB might have dangerous garbage in it.
>>
> So say we have kPTI, thus no vmalloc() mapped in CR3 when running
> userspace, and do a full TLB flush right before switching to userspace -
> could the TLB still end up with vmalloc()-range-related entries when we're
> back in the kernel and going through the danger zone?

Yes, because the danger zone includes the switch back to the kernel CR3
with vmalloc() fully mapped. All bets are off about what's in the TLB
the moment that CR3 write occurs.

Actually, you could probably use that.

If a mapping is in the PTI user page table, you can't defer the flushes
for it. Basically the same rule for text poking in the danger zone.

If there's a deferred flush pending, make sure that all of the
SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar
to user_pcid_flush_mask.

But, honestly, I'm still not sure this is worth all the trouble. If
folks want to avoid IPIs for TLB flushes, there are hardware features
that *DO* that. Just get new hardware instead of adding this complicated
pile of software that we have to maintain forever. In 10 years, we'll
still have this software *and* 95% of our hardware has the hardware
feature too.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-20 17:38                           ` Dave Hansen
@ 2025-02-26 16:52                             ` Valentin Schneider
  2025-03-25 17:52                             ` Valentin Schneider
  1 sibling, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-02-26 16:52 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 20/02/25 09:38, Dave Hansen wrote:
> On 2/20/25 09:10, Valentin Schneider wrote:
>>> The LDT and maybe the PEBS buffers are the only implicit supervisor
>>> accesses to vmalloc()'d memory that I can think of. But those are both
>>> handled specially and shouldn't ever get zapped while in use. The LDT
>>> replacement has its own IPIs separate from TLB flushing.
>>>
>>> But I'm actually not all that worried about accesses while actually
>>> running userspace. It's that "danger zone" in the kernel between entry
>>> and when the TLB might have dangerous garbage in it.
>>>
>> So say we have kPTI, thus no vmalloc() mapped in CR3 when running
>> userspace, and do a full TLB flush right before switching to userspace -
>> could the TLB still end up with vmalloc()-range-related entries when we're
>> back in the kernel and going through the danger zone?
>
> Yes, because the danger zone includes the switch back to the kernel CR3
> with vmalloc() fully mapped. All bets are off about what's in the TLB
> the moment that CR3 write occurs.
>
> Actually, you could probably use that.
>
> If a mapping is in the PTI user page table, you can't defer the flushes
> for it. Basically the same rule for text poking in the danger zone.
>
> If there's a deferred flush pending, make sure that all of the
> SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar
> to user_pcid_flush_mask.
>

Right, that's what I (roughly) had in mind...

> But, honestly, I'm still not sure this is worth all the trouble. If
> folks want to avoid IPIs for TLB flushes, there are hardware features
> that *DO* that. Just get new hardware instead of adding this complicated
> pile of software that we have to maintain forever. In 10 years, we'll
> still have this software *and* 95% of our hardware has the hardware
> feature too.

... But yeah, it pretty much circumvents arch_context_tracking_work, or at
the very least adds an early(er) flushing of the context tracking
work... Urgh.

Thank you for grounding my wild ideas into reality. I'll try to think some
more see if I see any other way out (other than "buy hardware that does
what you want and ditch the one that doesn't").


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-02-20 17:38                           ` Dave Hansen
  2025-02-26 16:52                             ` Valentin Schneider
@ 2025-03-25 17:52                             ` Valentin Schneider
  2025-03-25 18:41                               ` Jann Horn
  1 sibling, 1 reply; 86+ messages in thread
From: Valentin Schneider @ 2025-03-25 17:52 UTC (permalink / raw)
  To: Dave Hansen, Jann Horn
  Cc: linux-kernel, x86, virtualization, linux-arm-kernel, loongarch,
	linux-riscv, linux-perf-users, xen-devel, kvm, linux-arch, rcu,
	linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 20/02/25 09:38, Dave Hansen wrote:
> But, honestly, I'm still not sure this is worth all the trouble. If
> folks want to avoid IPIs for TLB flushes, there are hardware features
> that *DO* that. Just get new hardware instead of adding this complicated
> pile of software that we have to maintain forever. In 10 years, we'll
> still have this software *and* 95% of our hardware has the hardware
> feature too.

Sorry, you're going to have to deal with my ignorance a little bit longer...

Were you thinking x86 hardware specifically, or something else?
AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
IPI:

C5.5.78
"""
The invalidation applies to all PEs in the same Inner Shareable shareability domain as the PE that
executes this System instruction.
"""

But for (at least) these architectures:

  alpha
  x86
  loongarch
  mips
  (non-freescale 8xx) powerpc
  riscv
  xtensa

flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
so AFAICT for these the IPIs will be sent no matter the hardware.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-03-25 17:52                             ` Valentin Schneider
@ 2025-03-25 18:41                               ` Jann Horn
  2025-03-26  8:56                                 ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: Jann Horn @ 2025-03-25 18:41 UTC (permalink / raw)
  To: Valentin Schneider, Rik van Riel
  Cc: Dave Hansen, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On Tue, Mar 25, 2025 at 6:52 PM Valentin Schneider <vschneid@redhat.com> wrote:
> On 20/02/25 09:38, Dave Hansen wrote:
> > But, honestly, I'm still not sure this is worth all the trouble. If
> > folks want to avoid IPIs for TLB flushes, there are hardware features
> > that *DO* that. Just get new hardware instead of adding this complicated
> > pile of software that we have to maintain forever. In 10 years, we'll
> > still have this software *and* 95% of our hardware has the hardware
> > feature too.
>
> Sorry, you're going to have to deal with my ignorance a little bit longer...
>
> Were you thinking x86 hardware specifically, or something else?
> AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
> IPI:
>
> C5.5.78
> """
> The invalidation applies to all PEs in the same Inner Shareable shareability domain as the PE that
> executes this System instruction.
> """
>
> But for (at least) these architectures:
>
>   alpha
>   x86
>   loongarch
>   mips
>   (non-freescale 8xx) powerpc
>   riscv
>   xtensa
>
> flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
> so AFAICT for these the IPIs will be sent no matter the hardware.

On X86, both AMD and Intel have some fairly recently introduced CPU
features that can shoot down TLBs remotely.

The patch series
<https://lore.kernel.org/all/20250226030129.530345-1-riel@surriel.com/>
adds support for the AMD flavor; that series landed in the current
merge window (it's present in the mainline git repository now and should
be part of 6.15). I think support for the Intel flavor has not yet
been implemented, but the linked patch series mentions a plan to look
at the Intel flavor next.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2025-03-25 18:41                               ` Jann Horn
@ 2025-03-26  8:56                                 ` Valentin Schneider
  0 siblings, 0 replies; 86+ messages in thread
From: Valentin Schneider @ 2025-03-26  8:56 UTC (permalink / raw)
  To: Jann Horn, Rik van Riel
  Cc: Dave Hansen, linux-kernel, x86, virtualization, linux-arm-kernel,
	loongarch, linux-riscv, linux-perf-users, xen-devel, kvm,
	linux-arch, rcu, linux-hardening, linux-mm, linux-kselftest, bpf,
	bcm-kernel-feedback-list, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Boris Ostrovsky, Josh Poimboeuf, Pawan Gupta,
	Sean Christopherson, Paolo Bonzini, Andy Lutomirski,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Juri Lelli, Clark Williams, Yair Podemsky,
	Tomas Glozar, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Kees Cook, Andrew Morton, Christoph Hellwig,
	Shuah Khan, Sami Tolvanen, Miguel Ojeda, Alice Ryhl,
	Mike Rapoport (Microsoft), Samuel Holland, Rong Xu,
	Nicolas Saenz Julienne, Geert Uytterhoeven, Yosry Ahmed,
	Kirill A. Shutemov, Masami Hiramatsu (Google), Jinghao Jia,
	Luis Chamberlain, Randy Dunlap, Tiezhu Yang

On 25/03/25 19:41, Jann Horn wrote:
> On Tue, Mar 25, 2025 at 6:52 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 20/02/25 09:38, Dave Hansen wrote:
>> > But, honestly, I'm still not sure this is worth all the trouble. If
>> > folks want to avoid IPIs for TLB flushes, there are hardware features
>> > that *DO* that. Just get new hardware instead of adding this complicated
>> > pile of software that we have to maintain forever. In 10 years, we'll
>> > still have this software *and* 95% of our hardware has the hardware
>> > feature too.
>>
>> Sorry, you're going to have to deal with my ignorance a little bit longer...
>>
>> Were you thinking x86 hardware specifically, or something else?
>> AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
>> IPI:
>>
>> C5.5.78
>> """
>> The invalidation applies to all PEs in the same Inner Shareable shareability domain as the PE that
>> executes this System instruction.
>> """
>>
>> But for (at least) these architectures:
>>
>>   alpha
>>   x86
>>   loongarch
>>   mips
>>   (non-freescale 8xx) powerpc
>>   riscv
>>   xtensa
>>
>> flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
>> so AFAICT for these the IPIs will be sent no matter the hardware.
>
> On X86, both AMD and Intel have some fairly recently introduced CPU
> features that can shoot down TLBs remotely.
>
> The patch series
> <https://lore.kernel.org/all/20250226030129.530345-1-riel@surriel.com/>
> adds support for the AMD flavor; that series landed in the current
> merge window (it's present in the mainline git repository now and should
> be part of 6.15). I think support for the Intel flavor has not yet
> been implemented, but the linked patch series mentions a plan to look
> at the Intel flavor next.

Thanks for the info!


^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2025-03-26  8:56 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-14 17:51 [PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-01-21 13:56   ` Frederic Weisbecker
2025-01-14 17:51 ` [PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-01-21 14:00   ` Frederic Weisbecker
2025-01-14 17:51 ` [PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 06/30] static_call: Add read-only-after-init static calls Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 08/30] x86/idle: Mark x86_idle " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 10/30] riscv/paravirt: " Valentin Schneider
2025-01-14 18:29   ` Andrew Jones
2025-01-14 17:51 ` [PATCH v4 11/30] loongarch/paravirt: " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 12/30] arm64/paravirt: " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 13/30] arm/paravirt: " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 15/30] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-01-14 21:19   ` Sean Christopherson
2025-01-17  9:50     ` Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry Valentin Schneider
2025-01-22  0:22   ` Frederic Weisbecker
2025-01-22  1:04     ` Sean Christopherson
2025-01-27 11:17     ` Valentin Schneider
2025-02-07 17:06       ` Valentin Schneider
2025-02-07 18:37         ` Frederic Weisbecker
2025-02-10 17:36           ` Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-01-14 21:13   ` Sean Christopherson
2025-01-14 21:48     ` Sean Christopherson
2025-01-17  9:54       ` Valentin Schneider
2025-01-17  9:47     ` Valentin Schneider
2025-01-17 17:15       ` Sean Christopherson
2025-01-20 13:53         ` Valentin Schneider
2025-01-14 21:26   ` Sean Christopherson
2025-01-24 10:48   ` K Prateek Nayak
2025-01-14 17:51 ` [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant Valentin Schneider
2025-01-14 21:45   ` Dave Hansen
2025-01-17 13:44     ` Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant Valentin Schneider
2025-01-14 21:24   ` Sean Christopherson
2025-01-14 17:51 ` [PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr Valentin Schneider
2025-01-14 17:51 ` [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
2025-01-14 18:16   ` Jann Horn
2025-01-17 15:25     ` Valentin Schneider
2025-01-17 15:52       ` Jann Horn
2025-01-17 16:53         ` Valentin Schneider
2025-02-19 15:05           ` Joel Fernandes
2025-02-19 16:18             ` Valentin Schneider
2025-02-19 17:08               ` Joel Fernandes
2025-02-19 20:32                 ` Dave Hansen
2025-01-27 15:51         ` Will Deacon
2025-02-10 18:36         ` Valentin Schneider
2025-02-10 22:08           ` Jann Horn
2025-02-11 13:33             ` Valentin Schneider
2025-02-11 14:03               ` Mark Rutland
2025-02-11 16:09                 ` Valentin Schneider
2025-02-11 14:22               ` Dave Hansen
2025-02-11 16:10                 ` Valentin Schneider
2025-02-18 22:40                 ` Valentin Schneider
2025-02-19  0:39                   ` Dave Hansen
2025-02-19 15:13                     ` Valentin Schneider
2025-02-19 20:25                       ` Dave Hansen
2025-02-20 17:10                         ` Valentin Schneider
2025-02-20 17:38                           ` Dave Hansen
2025-02-26 16:52                             ` Valentin Schneider
2025-03-25 17:52                             ` Valentin Schneider
2025-03-25 18:41                               ` Jann Horn
2025-03-26  8:56                                 ` Valentin Schneider
2025-01-17 16:11       ` Uladzislau Rezki
2025-01-17 17:00         ` Valentin Schneider
2025-01-20 11:15           ` Uladzislau Rezki
2025-01-20 16:09             ` Valentin Schneider
2025-01-21 17:00               ` Uladzislau Rezki
2025-01-24 15:22                 ` Valentin Schneider
2025-01-27 10:36                   ` Uladzislau Rezki
2025-01-14 17:51 ` [PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).