public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2026-03-24  9:47 Valentin Schneider
  2026-03-24  9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24  9:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
	Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
	Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
	Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
	Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
	Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
	Shrikanth Hegde

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Previous versions would assign IPIs a "type" and have a mapping of IPI type to
callback, leveraged upon kernel entry via the context_tracking framework.

This version now gets rid of all that, and instead goes with an
"unconditionnally run a catch-up sequence at kernel entry" approach - as was
suggested at LPC 2025 [3].

Another point made during LPC25 (sorry I didn't get your name!) was that when
kPTI is in use, the use of global pages is very limited and thus a CR4 may not
be warranted for a kernel TLB flush. That means the existing CR3 RMW used to switch
between kernel and user page tables can be used as the unconditionnal TLB flush,
meaning I could get rid of my CR4 dance.

In the same spirit, turns out a CR3 RMW is a serializing instruction:

  SDM vol2 chapter 4.3 - Move to/from control registers:
  ```
  MOV CR* instructions, except for MOV CR8, are serializing instructions.
  ```
That means I don't need to do anything extra on kernel entry to handle deferred
sync_core() IPIs sent from text_poke().
  
So long story short, the CR3 RMW that is executed for every user <-> kernel
transition when kPTI is enabled does everything I need to defer kernel TLB flush
and kernel text update IPIs. 

From that, I've completely nuked the context_tracking deferral faff.
The added x86-specific code is now "just" about having a software signal
to figure out which CR3 a CPU is using - easier said than done, details in
the individual changelogs.

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:

  idtentry
    idtentry_body
      error_entry
        SWITCH_TO_KERNEL_CR3

This danger zone used to be much wider in v7 and earlier (from kernel entry all
the way down to ct_kernel_enter_state()). The objtool instrumentation thus now
targets .entry.text rather than .noinstr as a whole.

Show me numbers
===============

Xeon E5-2699 system with SMToff, NOHZ_FULL, 26 isolated CPUs.
RHEL10 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
                      -R "stacktrace if cpu & CPUS{$ISOL_CPUS}" \
                   -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.19
o ~6000 IPIs received, so about ~230 interfering IPI per isolated CPU
o About one interfering IPI roughly every 1 minute 30 seconds

v6.19 + patches
o Zilch... With some caveats

  I still get some TLB flush IPIs sent to seemingly still-in-userspace CPUs,
  about one per ~3h for /some/ runs. I haven't seen any in the last cumulated
  24h of testing...

  pcpu_balance_work also sometimes shows up, and isn't covered by the deferral
  faff. Again, sometimes it shows up, sometimes it doesn't and hasn't for a
  while now.

Patches
=======

o Patches 1-4 are standalone objtool cleanups.

o Patches 5-6 add infrastructure for annotating static keys that may be used in
  entry code (courtesy of Josh). 

o Patch 7 adds ASM support for static keys

o Patches 8-10 add the deferral mechanism.

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v8

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://lpc.events/event/19/contributions/2219/
[4]: https://lpc.events/event/18/contributions/1889/

Revisions
=========

v7 -> v8
++++++++

o Rebased onto v6.19

o Fixed objtool --uaccess validation preventing --noinstr validation of
  unwind hints
o Added more objtool --noinstr warning fixes
o Reduced objtool noinstr static key validation to just .entry.text

o Moved the kernel_cr3_loaded signal update to before writing to CR3

o Ditched context_tracking based deferral
o Ditched the (additionnal) unconditionnal TLB flush upon kernel entry

v6 -> v7
++++++++

o Rebased onto latest v6.18-rc5 (6fa9041b7177f)
o Collected Acks (Sean, Frederic)

o Fixed <asm/context_tracking_work.h> include (Shrikanth)
o Fixed ct_set_cpu_work() CT_RCU_WATCHING logic (Frederic)

o Wrote more verbose comments about NOINSTR static keys and calls (Petr)

o [NEW PATCH] Instrumented one more static key: cpu_bf_vm_clear
o [NEW PATCH] added ASM-accessible static key helpers to gate NO_HZ_FULL logic
  in early entry code (Frederic)

v5 -> v6
++++++++

o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming

o Added the TLB flush craziness

v4 -> v5
++++++++

o Rebased onto v6.15-rc3
o Collected Reviewed-by

o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)

o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)

o Ditched the vmap TLB flush deferral (for now)  
  

RFCv3 -> v4
+++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (1):
  objtool: Add .entry.text validation for static branches

Valentin Schneider (9):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  objtool: Always pass a section to validate_unwind_hints()
  x86/retpoline: Make warn_thunk_thunk .noinstr
  sched/isolation: Mark housekeeping_overridden key as __ro_after_init
  x86/jump_label: Add ASM support for static_branch_likely()
  x86/mm/pti: Introduce a kernel/user CR3 software signal
  context_tracking,x86: Defer kernel text patching IPIs when tracking
    CR3 switches
  x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3
    switches

 arch/x86/Kconfig                        |  14 +++
 arch/x86/entry/calling.h                |  13 +++
 arch/x86/entry/entry.S                  |   3 +-
 arch/x86/entry/syscall_64.c             |   4 +
 arch/x86/include/asm/jump_label.h       |  33 +++++++-
 arch/x86/include/asm/text-patching.h    |   5 ++
 arch/x86/include/asm/tlbflush.h         |   4 +
 arch/x86/kernel/alternative.c           |  34 ++++++--
 arch/x86/kernel/cpu/bugs.c              |   2 +-
 arch/x86/kernel/kprobes/core.c          |   4 +-
 arch/x86/kernel/kprobes/opt.c           |   4 +-
 arch/x86/kernel/module.c                |   2 +-
 arch/x86/mm/pti.c                       |  36 +++++---
 arch/x86/mm/tlb.c                       |  34 ++++++--
 include/linux/jump_label.h              |  11 ++-
 include/linux/objtool.h                 |  16 ++++
 kernel/sched/isolation.c                |   2 +-
 mm/vmalloc.c                            |  30 +++++--
 tools/objtool/Documentation/objtool.txt |  12 +++
 tools/objtool/check.c                   | 108 ++++++++++++++++++++----
 tools/objtool/include/objtool/check.h   |   2 +
 tools/objtool/include/objtool/elf.h     |   3 +-
 tools/objtool/include/objtool/special.h |   1 +
 tools/objtool/special.c                 |  15 +++-
 24 files changed, 331 insertions(+), 61 deletions(-)

--
2.52.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-03-24 19:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24  9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints() Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
2026-03-24 15:17   ` Shrikanth Hegde
2026-03-24 19:46     ` Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2026-03-24  9:47 ` [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2026-03-24  9:48 ` [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Valentin Schneider
2026-03-24  9:48 ` [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush " Valentin Schneider
2026-03-24 15:01 ` [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox