* [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2026-03-24 9:47 Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
` (10 more replies)
0 siblings, 11 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Previous versions would assign IPIs a "type" and have a mapping of IPI type to
callback, leveraged upon kernel entry via the context_tracking framework.
This version now gets rid of all that, and instead goes with an
"unconditionnally run a catch-up sequence at kernel entry" approach - as was
suggested at LPC 2025 [3].
Another point made during LPC25 (sorry I didn't get your name!) was that when
kPTI is in use, the use of global pages is very limited and thus a CR4 may not
be warranted for a kernel TLB flush. That means the existing CR3 RMW used to switch
between kernel and user page tables can be used as the unconditionnal TLB flush,
meaning I could get rid of my CR4 dance.
In the same spirit, turns out a CR3 RMW is a serializing instruction:
SDM vol2 chapter 4.3 - Move to/from control registers:
```
MOV CR* instructions, except for MOV CR8, are serializing instructions.
```
That means I don't need to do anything extra on kernel entry to handle deferred
sync_core() IPIs sent from text_poke().
So long story short, the CR3 RMW that is executed for every user <-> kernel
transition when kPTI is enabled does everything I need to defer kernel TLB flush
and kernel text update IPIs.
From that, I've completely nuked the context_tracking deferral faff.
The added x86-specific code is now "just" about having a software signal
to figure out which CR3 a CPU is using - easier said than done, details in
the individual changelogs.
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry
idtentry_body
error_entry
SWITCH_TO_KERNEL_CR3
This danger zone used to be much wider in v7 and earlier (from kernel entry all
the way down to ct_kernel_enter_state()). The objtool instrumentation thus now
targets .entry.text rather than .noinstr as a whole.
Show me numbers
===============
Xeon E5-2699 system with SMToff, NOHZ_FULL, 26 isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-R "stacktrace if cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.
v6.19
o ~6000 IPIs received, so about ~230 interfering IPI per isolated CPU
o About one interfering IPI roughly every 1 minute 30 seconds
v6.19 + patches
o Zilch... With some caveats
I still get some TLB flush IPIs sent to seemingly still-in-userspace CPUs,
about one per ~3h for /some/ runs. I haven't seen any in the last cumulated
24h of testing...
pcpu_balance_work also sometimes shows up, and isn't covered by the deferral
faff. Again, sometimes it shows up, sometimes it doesn't and hasn't for a
while now.
Patches
=======
o Patches 1-4 are standalone objtool cleanups.
o Patches 5-6 add infrastructure for annotating static keys that may be used in
entry code (courtesy of Josh).
o Patch 7 adds ASM support for static keys
o Patches 8-10 add the deferral mechanism.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v8
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://lpc.events/event/19/contributions/2219/
[4]: https://lpc.events/event/18/contributions/1889/
Revisions
=========
v7 -> v8
++++++++
o Rebased onto v6.19
o Fixed objtool --uaccess validation preventing --noinstr validation of
unwind hints
o Added more objtool --noinstr warning fixes
o Reduced objtool noinstr static key validation to just .entry.text
o Moved the kernel_cr3_loaded signal update to before writing to CR3
o Ditched context_tracking based deferral
o Ditched the (additionnal) unconditionnal TLB flush upon kernel entry
v6 -> v7
++++++++
o Rebased onto latest v6.18-rc5 (6fa9041b7177f)
o Collected Acks (Sean, Frederic)
o Fixed <asm/context_tracking_work.h> include (Shrikanth)
o Fixed ct_set_cpu_work() CT_RCU_WATCHING logic (Frederic)
o Wrote more verbose comments about NOINSTR static keys and calls (Petr)
o [NEW PATCH] Instrumented one more static key: cpu_bf_vm_clear
o [NEW PATCH] added ASM-accessible static key helpers to gate NO_HZ_FULL logic
in early entry code (Frederic)
v5 -> v6
++++++++
o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming
o Added the TLB flush craziness
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (1):
objtool: Add .entry.text validation for static branches
Valentin Schneider (9):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
objtool: Always pass a section to validate_unwind_hints()
x86/retpoline: Make warn_thunk_thunk .noinstr
sched/isolation: Mark housekeeping_overridden key as __ro_after_init
x86/jump_label: Add ASM support for static_branch_likely()
x86/mm/pti: Introduce a kernel/user CR3 software signal
context_tracking,x86: Defer kernel text patching IPIs when tracking
CR3 switches
x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3
switches
arch/x86/Kconfig | 14 +++
arch/x86/entry/calling.h | 13 +++
arch/x86/entry/entry.S | 3 +-
arch/x86/entry/syscall_64.c | 4 +
arch/x86/include/asm/jump_label.h | 33 +++++++-
arch/x86/include/asm/text-patching.h | 5 ++
arch/x86/include/asm/tlbflush.h | 4 +
arch/x86/kernel/alternative.c | 34 ++++++--
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/pti.c | 36 +++++---
arch/x86/mm/tlb.c | 34 ++++++--
include/linux/jump_label.h | 11 ++-
include/linux/objtool.h | 16 ++++
kernel/sched/isolation.c | 2 +-
mm/vmalloc.c | 30 +++++--
tools/objtool/Documentation/objtool.txt | 12 +++
tools/objtool/check.c | 108 ++++++++++++++++++++----
tools/objtool/include/objtool/check.h | 2 +
tools/objtool/include/objtool/elf.h | 3 +-
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 +++-
24 files changed, 331 insertions(+), 61 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[]
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
` (9 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
tools/objtool/check.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 3fd98c5b6e1a8..7c82934247484 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3388,7 +3388,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
return false;
}
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn)
{
static char pvname[19];
struct reloc *reloc;
@@ -3397,9 +3397,9 @@ static inline const char *call_dest_name(struct instruction *insn)
if (insn_call_dest(insn))
return insn_call_dest(insn)->name;
- reloc = insn_reloc(NULL, insn);
+ reloc = insn_reloc(file, insn);
if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
- idx = (reloc_addend(reloc) / sizeof(void *));
+ idx = arch_insn_adjusted_addend(insn, reloc) / sizeof(void *);
snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
return pvname;
}
@@ -3478,17 +3478,19 @@ static int validate_call(struct objtool_file *file,
{
if (state->noinstr && state->instr <= 0 &&
!noinstr_call_dest(file, insn, insn_call_dest(insn))) {
- WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn));
return 1;
}
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
- WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() with UACCESS enabled",
+ call_dest_name(file, insn));
return 1;
}
if (state->df) {
- WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() with DF set",
+ call_dest_name(file, insn));
return 1;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints() Valentin Schneider
` (8 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
I had to look into objtool itself to understand what this warning was
about; make it more explicit.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
tools/objtool/check.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 7c82934247484..418dce921e48d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3426,7 +3426,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
if (!target->sec->noinstr) {
- WARN("pv_ops[%d]: %s", idx, target->name);
+ WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
file->pv_ops[idx].clean = false;
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints()
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr Valentin Schneider
` (7 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
When passing a NULL @sec to validate_unwind_hints(), it is unable to
properly initialize the insn_state->noinstr passed down during
validation. This means we lose noinstr validation of the hints.
That validation currently happens when 'opts.noinstr' is true but
'validate_branch_enabled()' isn't.
In other words, this will run noinstr validation of hints:
$ objtool --noinstr --link [...]
but this won't:
$ objtool --noinstr --link --uaccess [...]
Always pass a valid section to validate_unwind_hints(), so that noinstr
validation of hints happens regardless of the value of
validate_branch_enabled().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
tools/objtool/check.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 418dce921e48d..b6e63d5beecc3 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4064,13 +4064,8 @@ static int validate_unwind_hints(struct objtool_file *file, struct section *sec)
init_insn_state(file, &state, sec);
- if (sec) {
- sec_for_each_insn(file, sec, insn)
- warnings += validate_unwind_hint(file, insn, &state);
- } else {
- for_each_insn(file, insn)
- warnings += validate_unwind_hint(file, insn, &state);
- }
+ sec_for_each_insn(file, sec, insn)
+ warnings += validate_unwind_hint(file, insn, &state);
return warnings;
}
@@ -4567,6 +4562,21 @@ static int validate_functions(struct objtool_file *file)
return warnings;
}
+static int validate_file_unwind_hints(struct objtool_file *file)
+{
+ struct section *sec;
+ int warnings = 0;
+
+ for_each_sec(file->elf, sec) {
+ if (!is_text_sec(sec))
+ continue;
+
+ warnings += validate_unwind_hints(file, sec);
+ }
+
+ return warnings;
+}
+
static void mark_endbr_used(struct instruction *insn)
{
if (!list_empty(&insn->call_node))
@@ -4976,7 +4986,8 @@ int check(struct objtool_file *file)
int w = 0;
w += validate_functions(file);
- w += validate_unwind_hints(file, NULL);
+ w += validate_file_unwind_hints(file);
+
if (!w)
w += validate_reachable_instructions(file);
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (2 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints() Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
` (6 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
Objtool now warns about it:
vmlinux.o: warning: objtool: .altinstr_replacement+0x28e1: call to warn_thunk_thunk() leaves .noinstr.text section
Mark it noinstr.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/entry/entry.S | 3 ++-
arch/x86/kernel/cpu/bugs.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
index 6ba2b3adcef0f..e76560f86b332 100644
--- a/arch/x86/entry/entry.S
+++ b/arch/x86/entry/entry.S
@@ -40,6 +40,8 @@ SYM_FUNC_START(__WARN_trap)
SYM_FUNC_END(__WARN_trap)
EXPORT_SYMBOL(__WARN_trap)
+THUNK warn_thunk_thunk, __warn_thunk
+
.popsection
/*
@@ -60,7 +62,6 @@ EXPORT_SYMBOL_FOR_KVM(x86_verw_sel);
.popsection
-THUNK warn_thunk_thunk, __warn_thunk
/*
* Clang's implementation of TLS stack cookies requires the variable in
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d0a2847a4bb05..1ddf9355a37af 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3732,7 +3732,7 @@ ssize_t cpu_show_vmscape(struct device *dev, struct device_attribute *attr, char
}
#endif
-void __warn_thunk(void)
+void noinstr __warn_thunk(void)
{
WARN_ONCE(1, "Unpatched return thunk in use. This should not happen!\n");
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (3 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 15:17 ` Shrikanth Hegde
2026-03-24 9:47 ` [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches Valentin Schneider
` (5 subsequent siblings)
10 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
housekeeping_overridden is only ever enabled in the __init function
housekeeping_init(), and is never disabled. Mark it __ro_after_init.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/isolation.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3ad0d6df6a0a2..54d1d93cdeea5 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -16,7 +16,7 @@ enum hk_flags {
HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
};
-DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
+DEFINE_STATIC_KEY_FALSE_RO(housekeeping_overridden);
EXPORT_SYMBOL_GPL(housekeeping_overridden);
struct housekeeping {
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (4 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
` (4 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paolo Bonzini, Arnd Bergmann,
Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
From: Josh Poimboeuf <jpoimboe@kernel.org>
Warn about static branches in entry text, unless the corresponding key is
RO-after-init.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
[Reduced to only .entry.text rather than .noinstr]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/jump_label.h | 11 +++--
include/linux/objtool.h | 16 ++++++
tools/objtool/Documentation/objtool.txt | 12 +++++
tools/objtool/check.c | 65 ++++++++++++++++++++++++-
tools/objtool/include/objtool/check.h | 2 +
tools/objtool/include/objtool/elf.h | 3 +-
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 +++++-
8 files changed, 118 insertions(+), 7 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index fdb79dd1ebd8c..9f05338a2f798 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -76,6 +76,7 @@
#include <linux/types.h>
#include <linux/compiler.h>
#include <linux/cleanup.h>
+#include <linux/objtool.h>
extern bool static_key_initialized;
@@ -376,8 +377,9 @@ struct static_key_false {
#define DEFINE_STATIC_KEY_TRUE(name) \
struct static_key_true name = STATIC_KEY_TRUE_INIT
-#define DEFINE_STATIC_KEY_TRUE_RO(name) \
- struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_RO(name) \
+ struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT; \
+ ANNOTATE_ENTRY_ALLOWED(name)
#define DECLARE_STATIC_KEY_TRUE(name) \
extern struct static_key_true name
@@ -385,8 +387,9 @@ struct static_key_false {
#define DEFINE_STATIC_KEY_FALSE(name) \
struct static_key_false name = STATIC_KEY_FALSE_INIT
-#define DEFINE_STATIC_KEY_FALSE_RO(name) \
- struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_RO(name) \
+ struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT; \
+ ANNOTATE_ENTRY_ALLOWED(name)
#define DECLARE_STATIC_KEY_FALSE(name) \
extern struct static_key_false name
diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index 9a00e701454c5..d738450897b3b 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -34,6 +34,19 @@
static void __used __section(".discard.func_stack_frame_non_standard") \
*__func_stack_frame_non_standard_##func = func
+#define __ANNOTATE_ENTRY_ALLOWED(key) \
+ static void __used __section(".discard.entry_allowed") \
+ *__annotate_entry_allowed_##key = &key
+
+/*
+ * This is used to tell objtool that a given static key is safe to be used
+ * within .noinstr code, and it doesn't need to generate a warning about it.
+ *
+ * For more information, see tools/objtool/Documentation/objtool.txt,
+ * "non-RO static key usage in entry code"
+ */
+#define ANNOTATE_ENTRY_ALLOWED(key) __ANNOTATE_ENTRY_ALLOWED(key)
+
/*
* STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore
* for the case where a function is intentionally missing frame pointer setup,
@@ -111,6 +124,9 @@
#define UNWIND_HINT(type, sp_reg, sp_offset, signal) "\n\t"
#define STACK_FRAME_NON_STANDARD(func)
#define STACK_FRAME_NON_STANDARD_FP(func)
+#define __ASM_ANNOTATE(label, type) ""
+#define ASM_ANNOTATE(type)
+#define ANNOTATE_ENTRY_ALLOWED(key)
#else
.macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
.endm
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt
index 9e97fc25b2d8a..72fd8cbf56abc 100644
--- a/tools/objtool/Documentation/objtool.txt
+++ b/tools/objtool/Documentation/objtool.txt
@@ -456,6 +456,18 @@ the objtool maintainers.
these special names and does not use module_init() / module_exit()
macros to create them.
+vmlinux.o: warning: objtool: entry_SYSCALL_64+0x108: housekeeping_overridden: non-RO static key usage in entry code
+
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in entry code
+
+ This means that .entry.text function func() uses a static key named 'key'
+ which can be modified at runtime. This is discouraged because the jump
+ location may be accessed before a serializating operation has been
+ executed.
+
+ Check whether the static key/call in question is only modified
+ during init. If so, define it as read-only-after-init with
+ DEFINE_STATIC_KEY_*_RO().
If the error doesn't seem to make sense, it could be a bug in objtool.
Feel free to ask objtool maintainers for help.
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index b6e63d5beecc3..a76364eb8a4f5 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -327,8 +327,10 @@ static void init_insn_state(struct objtool_file *file, struct insn_state *state,
memset(state, 0, sizeof(*state));
init_cfi_state(&state->cfi);
- if (opts.noinstr && sec)
+ if (opts.noinstr && sec) {
state->noinstr = sec->noinstr;
+ state->entry = sec->entry;
+ }
}
static struct cfi_state *cfi_alloc(void)
@@ -433,6 +435,9 @@ static int decode_instructions(struct objtool_file *file)
!strncmp(sec->name, ".text..__x86.", 13))
sec->noinstr = true;
+ if (!strcmp(sec->name, ".entry.text"))
+ sec->entry= true;
+
/*
* .init.text code is ran before userspace and thus doesn't
* strictly need retpolines, except for modules which are
@@ -1035,6 +1040,45 @@ static int create_sym_checksum_section(struct objtool_file *file)
static int create_sym_checksum_section(struct objtool_file *file) { return -EINVAL; }
#endif
+static int read_entry_allowed(struct objtool_file *file)
+{
+ struct section *rsec;
+ struct symbol *sym;
+ struct reloc *reloc;
+
+ rsec = find_section_by_name(file->elf, ".rela.discard.entry_allowed");
+ if (!rsec)
+ return 0;
+
+ for_each_reloc(rsec, reloc) {
+ switch (reloc->sym->type) {
+ case STT_OBJECT:
+ case STT_FUNC:
+ sym = reloc->sym;
+ break;
+
+ case STT_SECTION:
+ sym = find_symbol_by_offset(reloc->sym->sec,
+ reloc_addend(reloc));
+ if (!sym) {
+ WARN_FUNC(reloc->sym->sec, reloc_addend(reloc),
+ "can't find static key/call symbol");
+ return -1;
+ }
+ break;
+
+ default:
+ WARN("unexpected relocation symbol type in %s: %d",
+ rsec->name, reloc->sym->type);
+ return -1;
+ }
+
+ sym->entry_allowed = 1;
+ }
+
+ return 0;
+}
+
/*
* Warnings shouldn't be reported for ignored functions.
*/
@@ -1878,6 +1922,8 @@ static int handle_jump_alt(struct objtool_file *file,
return -1;
}
+ orig_insn->key = special_alt->key;
+
if (opts.hack_jump_label && special_alt->key_addend & 2) {
struct reloc *reloc = insn_reloc(file, orig_insn);
@@ -2660,6 +2706,9 @@ static int decode_sections(struct objtool_file *file)
if (read_annotate(file, __annotate_late))
return -1;
+ if (read_entry_allowed(file))
+ return -1;
+
return 0;
}
@@ -3544,6 +3593,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
return 0;
}
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+ if (state->entry && !insn->key->entry_allowed) {
+ WARN_INSN(insn, "%s: non-RO static key usage in entry code",
+ insn->key->name);
+ return 1;
+ }
+
+ return 0;
+}
+
static struct instruction *next_insn_to_validate(struct objtool_file *file,
struct instruction *insn)
{
@@ -3807,6 +3867,9 @@ static int validate_insn(struct objtool_file *file, struct symbol *func,
if (handle_insn_ops(insn, next_insn, statep))
return 1;
+ if (insn->key)
+ validate_static_key(insn, statep);
+
switch (insn->type) {
case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index 2e1346ad5e926..78bf8191be18d 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -16,6 +16,7 @@ struct insn_state {
bool uaccess;
bool df;
bool noinstr;
+ bool entry;
s8 instr;
};
@@ -97,6 +98,7 @@ struct instruction {
struct symbol *sym;
struct stack_op *stack_ops;
struct cfi_state *cfi;
+ struct symbol *key;
};
static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h
index e12c516bd3200..9d12f7132311a 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -51,7 +51,7 @@ struct section {
Elf_Data *data;
const char *name;
int idx;
- bool _changed, text, rodata, noinstr, init, truncate;
+ bool _changed, text, rodata, noinstr, init, truncate, entry;
struct reloc *relocs;
unsigned long nr_alloc_relocs;
struct section *twin;
@@ -89,6 +89,7 @@ struct symbol {
u8 changed : 1;
u8 included : 1;
u8 klp : 1;
+ u8 entry_allowed : 1;
struct list_head pv_target;
struct reloc *relocs;
struct section *group_sec;
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 121c3761899c1..2298586a75479 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -18,6 +18,7 @@ struct special_alt {
bool group;
bool jump_or_nop;
u8 key_addend;
+ struct symbol *key;
struct section *orig_sec;
unsigned long orig_off;
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 2a533afbc69aa..adec1d0d8a5fe 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -111,13 +111,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
if (entry->key) {
struct reloc *key_reloc;
+ struct symbol *key;
+ s64 key_addend;
key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key);
if (!key_reloc) {
ERROR_FUNC(sec, offset + entry->key, "can't find key reloc");
return -1;
}
- alt->key_addend = reloc_addend(key_reloc);
+
+ key = key_reloc->sym;
+ key_addend = reloc_addend(key_reloc);
+
+ if (key->type == STT_SECTION)
+ key = find_symbol_by_offset(key->sec, key_addend & ~3);
+
+ /* embedded keys not supported */
+ if (key) {
+ alt->key = key;
+ alt->key_addend = key_addend;
+ }
}
return 0;
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely()
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (5 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
` (3 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
Peter Zijlstra, Arnaldo Carvalho de Melo, Josh Poimboeuf,
Paolo Bonzini, Arnd Bergmann, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
A later commit will add some early entry code that only needs to be
executed if nohz_full is present on the cmdline, not just if
CONFIG_NO_HZ_FULL is compiled in. Add an ASM-callable static branch macro.
Note that I haven't found a way to express unlikely (i.e. out-of-line)
static branches in ASM macros without using extra jumps, which kind of
defeats the purpose. Consider:
.macro FOOBAR
// Key enabled: JMP .Ldostuff_\@
// Key disabled: NOP
STATIC_BRANCH_UNLIKELY key, .Ldostuff_\@ // Patched to JMP if enabled
jmp .Lend_\@
.Ldostuff_\@:
<dostuff>
.Lend_\@:
.endm
Instead, this should be expressed as a likely (i.e. in-line) static key:
.macro FOOBAR
// Key enabled: NOP
// Key disabled: JMP .Lend_\@
STATIC_BRANCH_LIKELY key, .Lend\@ // Patched to NOP if enabled
<dostuff>
.Lend_\@:
.endm
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/jump_label.h | 33 ++++++++++++++++++++++++++++++-
1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 05b16299588d5..ea587598abe7c 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -7,7 +7,38 @@
#include <asm/asm.h>
#include <asm/nops.h>
-#ifndef __ASSEMBLER__
+#ifdef __ASSEMBLER__
+
+/*
+ * There isn't a neat way to craft unlikely static branches in ASM, so they
+ * all have to be expressed as likely (inline) static branches. This macro
+ * thus assumes a "likely" usage.
+ */
+.macro ARCH_STATIC_BRANCH_LIKELY_ASM key, label, jump, hack
+1:
+.if \jump || \hack
+ jmp \label
+.else
+ .byte BYTES_NOP5
+.endif
+ .pushsection __jump_table, "aw"
+ _ASM_ALIGN
+ .long 1b - .
+ .long \label - .
+ /* LIKELY so bit0=1, bit1=hack */
+ _ASM_PTR \key + 1 + (\hack << 1) - .
+ .popsection
+.endm
+
+.macro STATIC_BRANCH_TRUE_LIKELY key, label
+ ARCH_STATIC_BRANCH_LIKELY_ASM \key, \label, 0, IS_ENABLED(CONFIG_HAVE_JUMP_LABEL_HACK)
+.endm
+
+.macro STATIC_BRANCH_FALSE_LIKELY key, label
+ ARCH_STATIC_BRANCH_LIKELY_ASM \key, \label, 1, 0
+.endm
+
+#else /* !__ASSEMBLER__ */
#include <linux/stringify.h>
#include <linux/types.h>
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (6 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
@ 2026-03-24 9:47 ` Valentin Schneider
2026-03-24 9:48 ` [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Valentin Schneider
` (2 subsequent siblings)
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
Later commits will rely on being able to check whether a remote CPU is
using the kernel or the user CR3.
This software signal needs to be updated before the actual CR3 write, IOW
it always immediately precedes it:
KERNEL_CR3_LOADED := 1
SWITCH_TO_KERNEL_CR3
[...]
KERNEL_CR3_LOADED := 0
SWITCH_TO_USER_CR3
The variable also gets mapped into the user space visible pages.
I tried really hard not to do that, and at some point had something mostly
working with having an alias to it through the cpu_entry_area accessed like
so before the switch to the kernel CR3:
subq $10, %rsp
sgdt (%rsp)
movq 2(%rsp), \scratch_reg /* GDT address */
addq $10, %rsp
movl $1, CPU_ENTRY_AREA_kernel_cr3(\scratch_reg)
however this explodes when running 64-bit user code that invokes SYSCALL,
since the scratch reg is %rsp itself, and I figured this was enough headaches.
This will only be really useful for NOHZ_FULL CPUs, but it should be
cheaper to unconditionally update a never-used per-CPU variable living in
its own cacheline than to check a shared cpumask such as
housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
at every entry.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/Kconfig | 14 +++++++++++++
arch/x86/entry/calling.h | 13 ++++++++++++
arch/x86/entry/syscall_64.c | 4 ++++
arch/x86/include/asm/tlbflush.h | 3 +++
arch/x86/mm/pti.c | 36 ++++++++++++++++++++++-----------
5 files changed, 58 insertions(+), 12 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 80527299f859a..f680e83cd5962 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2192,6 +2192,20 @@ config ADDRESS_MASKING
The capability can be used for efficient address sanitizers (ASAN)
implementation and for optimizations in JITs.
+config TRACK_CR3
+ def_bool n
+ prompt "Track which CR3 is in use"
+ depends on X86_64 && MITIGATION_PAGE_TABLE_ISOLATION && NO_HZ_FULL
+ help
+ This option adds a software signal that allows checking remotely
+ whether a CPU is using the user or the kernel page table.
+
+ This allows further optimizations for NOHZ_FULL CPUs.
+
+ This obviously makes the user<->kernel transition overhead even worse.
+
+ If unsure, say N.
+
config HOTPLUG_CPU
def_bool y
depends on SMP
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 77e2d920a6407..4099b7d86efd9 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -9,6 +9,7 @@
#include <asm/ptrace-abi.h>
#include <asm/msr.h>
#include <asm/nospec-branch.h>
+#include <asm/jump_label.h>
/*
@@ -170,8 +171,17 @@ For 32-bit we have the following conventions - kernel is built with
andq $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
.endm
+.macro NOTE_CR3_SWITCH scratch_reg:req in_kernel:req
+#ifdef CONFIG_TRACK_CR3
+ STATIC_BRANCH_FALSE_LIKELY housekeeping_overridden, .Lend_\@
+ movl \in_kernel, PER_CPU_VAR(kernel_cr3_loaded)
+.Lend_\@:
+#endif // CONFIG_TRACK_CR3
+.endm
+
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ NOTE_CR3_SWITCH \scratch_reg $1
mov %cr3, \scratch_reg
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
@@ -182,6 +192,7 @@ For 32-bit we have the following conventions - kernel is built with
PER_CPU_VAR(cpu_tlbstate + TLB_STATE_user_pcid_flush_mask)
.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req
+ NOTE_CR3_SWITCH \scratch_reg $0
mov %cr3, \scratch_reg
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -229,6 +240,7 @@ For 32-bit we have the following conventions - kernel is built with
.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
+ NOTE_CR3_SWITCH \scratch_reg $1
movq %cr3, \scratch_reg
movq \scratch_reg, \save_reg
/*
@@ -257,6 +269,7 @@ For 32-bit we have the following conventions - kernel is built with
bt $PTI_USER_PGTABLE_BIT, \save_reg
jnc .Lend_\@
+ NOTE_CR3_SWITCH \scratch_reg $0
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index b6e68ea98b839..7583f71978856 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -83,6 +83,10 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
return false;
}
+#ifdef CONFIG_TRACK_CR3
+DEFINE_PER_CPU_PAGE_ALIGNED(bool, kernel_cr3_loaded) = true;
+#endif
+
/* Returns true to return using SYSRET, or false to use IRET */
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 00daedfefc1b0..3b3aceee701e6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,9 @@
#include <asm/pgtable.h>
DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+#ifdef CONFIG_TRACK_CR3
+DECLARE_PER_CPU_PAGE_ALIGNED(bool, kernel_cr3_loaded);
+#endif
void __flush_tlb_all(void);
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index f7546e9e8e896..e75450cabd3a6 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -440,6 +440,18 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
}
+static void __init pti_clone_percpu(unsigned long va)
+{
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va, false);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
/*
* Clone the CPU_ENTRY_AREA and associated data into the user space visible
* page table.
@@ -450,25 +462,25 @@ static void __init pti_clone_user_shared(void)
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+ /*
+ * This is done for all possible CPUs during boot to ensure that it's
+ * propagated to all mms.
+ */
for_each_possible_cpu(cpu) {
/*
* The SYSCALL64 entry code needs one word of scratch space
* in which to spill a register. It lives in the sp2 slot
* of the CPU's TSS.
- *
- * This is done for all possible CPUs during boot to ensure
- * that it's propagated to all mms.
*/
+ pti_clone_percpu((unsigned long)&per_cpu(cpu_tss_rw, cpu));
- unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
- phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
- pte_t *target_pte;
-
- target_pte = pti_user_pagetable_walk_pte(va, false);
- if (WARN_ON(!target_pte))
- return;
-
- *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+#ifdef CONFIG_TRACK_CR3
+ /*
+ * The entry code needs access to the @kernel_cr3_loaded percpu
+ * variable before the kernel CR3 is loaded.
+ */
+ pti_clone_percpu((unsigned long)&per_cpu(kernel_cr3_loaded, cpu));
+#endif
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (7 preceding siblings ...)
2026-03-24 9:47 ` [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
@ 2026-03-24 9:48 ` Valentin Schneider
2026-03-24 9:48 ` [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush " Valentin Schneider
2026-03-24 15:01 ` [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition syzbot ci
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Peter Zijlstra (Intel), Nicolas Saenz Julienne, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Andy Lutomirski, Arnaldo Carvalho de Melo, Josh Poimboeuf,
Paolo Bonzini, Arnd Bergmann, Frederic Weisbecker,
Paul E. McKenney, Jason Baron, Steven Rostedt, Ard Biesheuvel,
Sami Tolvanen, David S. Miller, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Mathieu Desnoyers,
Mel Gorman, Andrew Morton, Masahiro Yamada, Han Shen,
Rik van Riel, Jann Horn, Dan Carpenter, Oleg Nesterov, Juri Lelli,
Clark Williams, Tomas Glozar, Yair Podemsky, Marcelo Tosatti,
Daniel Wagner, Petr Tesarik, Shrikanth Hegde
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- static calls
- all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.
This leaves us with static keys and static calls. Any static key used in
early entry code should be only forever-enabled at boot time, IOW
__ro_after_init (pretty much like alternatives). Exceptions to that will
now be caught by objtool.
The deferred instruction sync is the CR3 RMW done as part of
kPTI when switching to the kernel page table:
SDM vol2 chapter 4.3 - Move to/from control registers:
```
MOV CR* instructions, except for MOV CR8, are serializing instructions.
```
Leverage the new kernel_cr3_loaded signal and the kPTI CR3 RMW to defer
sync_core() IPIs targeting NOHZ_FULL CPUs.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/text-patching.h | 5 ++++
arch/x86/kernel/alternative.c | 34 +++++++++++++++++++++++-----
arch/x86/kernel/kprobes/core.c | 4 ++--
arch/x86/kernel/kprobes/opt.c | 4 ++--
arch/x86/kernel/module.c | 2 +-
5 files changed, 38 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index f2d142a0a862e..628e80f8318cd 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,11 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void smp_text_poke_sync_each_cpu(void);
+#ifdef CONFIG_TRACK_CR3
+extern void smp_text_poke_sync_each_cpu_deferrable(void);
+#else
+#define smp_text_poke_sync_each_cpu_deferrable smp_text_poke_sync_each_cpu
+#endif
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
#define text_poke_copy text_poke_copy
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 28518371d8bf3..f3af77d7c533c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -6,6 +6,7 @@
#include <linux/vmalloc.h>
#include <linux/memory.h>
#include <linux/execmem.h>
+#include <linux/sched/isolation.h>
#include <asm/text-patching.h>
#include <asm/insn.h>
@@ -13,6 +14,7 @@
#include <asm/ibt.h>
#include <asm/set_memory.h>
#include <asm/nmi.h>
+#include <asm/tlbflush.h>
int __read_mostly alternatives_patched;
@@ -2706,11 +2708,29 @@ static void do_sync_core(void *info)
sync_core();
}
+static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func)
+{
+ on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
void smp_text_poke_sync_each_cpu(void)
{
- on_each_cpu(do_sync_core, NULL, 1);
+ __smp_text_poke_sync_each_cpu(NULL);
+}
+
+#ifdef CONFIG_TRACK_CR3
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+ return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
+ per_cpu(kernel_cr3_loaded, cpu);
}
+void smp_text_poke_sync_each_cpu_deferrable(void)
+{
+ __smp_text_poke_sync_each_cpu(do_sync_core_defer_cond);
+}
+#endif
+
/*
* NOTE: crazy scheme to allow patching Jcc.d32 but not increase the size of
* this thing. When len == 6 everything is prefixed with 0x0f and we map
@@ -2914,11 +2934,13 @@ void smp_text_poke_batch_finish(void)
* First step: add a INT3 trap to the address that will be patched.
*/
for (i = 0; i < text_poke_array.nr_entries; i++) {
- text_poke_array.vec[i].old = *(u8 *)text_poke_addr(&text_poke_array.vec[i]);
- text_poke(text_poke_addr(&text_poke_array.vec[i]), &int3, INT3_INSN_SIZE);
+ void *addr = text_poke_addr(&text_poke_array.vec[i]);
+
+ text_poke_array.vec[i].old = *((u8 *)addr);
+ text_poke(addr, &int3, INT3_INSN_SIZE);
}
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
/*
* Second step: update all but the first byte of the patched range.
@@ -2980,7 +3002,7 @@ void smp_text_poke_batch_finish(void)
* not necessary and we'd be safe even without it. But
* better safe than sorry (plus there's not only Intel).
*/
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
}
/*
@@ -3001,7 +3023,7 @@ void smp_text_poke_batch_finish(void)
}
if (do_sync)
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
/*
* Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index c1fac3a9fecc2..61a93ba30f255 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -789,7 +789,7 @@ void arch_arm_kprobe(struct kprobe *p)
u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1);
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
}
@@ -799,7 +799,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
text_poke(p->addr, &p->opcode, 1);
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
}
void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 6f826a00eca29..3b3be66da320c 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -509,11 +509,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE);
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
text_poke(addr + INT3_INSN_SIZE,
new + INT3_INSN_SIZE,
JMP32_INSN_SIZE - INT3_INSN_SIZE);
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 11c45ce42694c..0894b1f38de77 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -209,7 +209,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
write, apply);
if (!early) {
- smp_text_poke_sync_each_cpu();
+ smp_text_poke_sync_each_cpu_deferrable();
mutex_unlock(&text_mutex);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3 switches
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (8 preceding siblings ...)
2026-03-24 9:48 ` [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Valentin Schneider
@ 2026-03-24 9:48 ` Valentin Schneider
2026-03-24 15:01 ` [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition syzbot ci
10 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 9:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik,
Shrikanth Hegde
Previous commits have added a software signal that tracks which CR3 (kernel
or user) is in use for any given CPU.
Combined with:
o the CR3 switch itself being a flush for non-global mappings
o global mappings under kPTI being limited to the CEA and entry text
we now have a way to safely defer (kernel) TLB flush IPIs targeting
NOHZ_FULL CPUs executing in userspace (i.e. with the user CR3 loaded).
When sending a kernel TLB flush IPI to a NOHZ_FULL CPU, check whether it is
using the user CR3, and if it is, do not interrupt it and instead rely on
the CR3 write that happens when switching to the kernel CR3.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 34 ++++++++++++++++++++++++++-------
mm/vmalloc.c | 30 ++++++++++++++++++++++++-----
3 files changed, 53 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3b3aceee701e6..8bae150206665 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -22,6 +22,7 @@ DECLARE_PER_CPU_PAGE_ALIGNED(bool, kernel_cr3_loaded);
#endif
void __flush_tlb_all(void);
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
#define TLB_FLUSH_ALL -1UL
#define TLB_GENERATION_INVALID 0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f5b93e01e3472..e08f16474f074 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -13,6 +13,7 @@
#include <linux/mmu_notifier.h>
#include <linux/mmu_context.h>
#include <linux/kvm_types.h>
+#include <linux/sched/isolation.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1530,23 +1531,24 @@ static void do_kernel_range_flush(void *info)
flush_tlb_one_kernel(addr);
}
-static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+static void kernel_tlb_flush_all(smp_cond_func_t cond, struct flush_tlb_info *info)
{
if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
invlpgb_flush_all();
else
- on_each_cpu(do_flush_tlb_all, NULL, 1);
+ on_each_cpu_cond(cond, do_flush_tlb_all, NULL, 1);
}
-static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+static void kernel_tlb_flush_range(smp_cond_func_t cond, struct flush_tlb_info *info)
{
if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
invlpgb_kernel_range_flush(info);
else
- on_each_cpu(do_kernel_range_flush, info, 1);
+ on_each_cpu_cond(cond, do_kernel_range_flush, info, 1);
}
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond, unsigned long start, unsigned long end)
{
struct flush_tlb_info *info;
@@ -1556,13 +1558,31 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
TLB_GENERATION_INVALID);
if (info->end == TLB_FLUSH_ALL)
- kernel_tlb_flush_all(info);
+ kernel_tlb_flush_all(cond, info);
else
- kernel_tlb_flush_range(info);
+ kernel_tlb_flush_range(cond, info);
put_flush_tlb_info();
}
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(NULL, start, end);
+}
+
+#ifdef CONFIG_TRACK_CR3
+static bool flush_tlb_kernel_cond(int cpu, void *info)
+{
+ return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
+ per_cpu(kernel_cr3_loaded, cpu);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(flush_tlb_kernel_cond, start, end);
+}
+#endif
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e286c2d2068cb..55b7bafe26016 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -501,6 +501,26 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
__vunmap_range_noflush(start, end);
}
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is a recipe for disaster.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+ flush_tlb_kernel_range(start, end);
+}
+
/**
* vunmap_range - unmap kernel virtual addresses
* @addr: start of the VM area to unmap
@@ -514,7 +534,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
{
flush_cache_vunmap(addr, end);
vunmap_range_noflush(addr, end);
- flush_tlb_kernel_range(addr, end);
+ flush_tlb_kernel_range_deferrable(addr, end);
}
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2366,7 +2386,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
nr_purge_nodes = cpumask_weight(&purge_nodes);
if (nr_purge_nodes > 0) {
- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
/* One extra worker is per a lazy_max_pages() full set minus one. */
nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2469,7 +2489,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
flush_cache_vunmap(va->va_start, va->va_end);
vunmap_range_noflush(va->va_start, va->va_end);
if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(va->va_start, va->va_end);
+ flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
free_vmap_area_noflush(va);
}
@@ -2916,7 +2936,7 @@ static void vb_free(unsigned long addr, unsigned long size)
vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(addr, addr + size);
+ flush_tlb_kernel_range_deferrable(addr, addr + size);
spin_lock(&vb->lock);
@@ -2981,7 +3001,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
free_purged_blocks(&purge_list);
if (!__purge_vmap_area_lazy(start, end, false) && flush)
- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
mutex_unlock(&vmap_purge_lock);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (9 preceding siblings ...)
2026-03-24 9:48 ` [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush " Valentin Schneider
@ 2026-03-24 15:01 ` syzbot ci
10 siblings, 0 replies; 14+ messages in thread
From: syzbot ci @ 2026-03-24 15:01 UTC (permalink / raw)
To: acme, akpm, ardb, arnd, boqun.feng, bp, dan.carpenter,
dave.hansen, davem, dwagner, frederic, hpa, jannh, jbaron,
joelagnelf, josh, jpoimboe, juri.lelli, linux-kernel, linux-mm,
luto, masahiroy, mathieu.desnoyers, mgorman, mingo, mtosatti,
neeraj.upadhyay, nsaenzju, oleg, paulmck, pbonzini, peterz,
ptesarik, riel, rostedt, samitolvanen, shenhan, sshegde, tglozar,
tglx, urezki, vschneid, williams, x86, ypodemsk
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v8] context_tracking,x86: Defer some IPIs until a user->kernel transition
https://lore.kernel.org/all/20260324094801.3092968-1-vschneid@redhat.com
* [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[]
* [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls
* [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints()
* [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr
* [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init
* [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches
* [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely()
* [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal
* [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches
* [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3 switches
and found the following issues:
* KASAN: slab-out-of-bounds Read in __dynamic_pr_debug
* KASAN: slab-use-after-free Read in __dynamic_dev_dbg
Full report is available here:
https://ci.syzbot.org/series/e1f9c661-db83-4882-8439-ab6d1b3ffe07
***
KASAN: slab-out-of-bounds Read in __dynamic_pr_debug
tree: linux-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base: f9d6fc9557e68b48253818870d002dc4784cb2f1
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/c63282ed-913a-4673-b0e4-cd21246874b2/config
C repro: https://ci.syzbot.org/findings/5b3af331-6a29-4205-911e-8924b9c54449/c_repro
syz repro: https://ci.syzbot.org/findings/5b3af331-6a29-4205-911e-8924b9c54449/syz_repro
==================================================================
BUG: KASAN: slab-out-of-bounds in string_nocheck lib/vsprintf.c:654 [inline]
BUG: KASAN: slab-out-of-bounds in string+0x231/0x2b0 lib/vsprintf.c:736
Read of size 1 at addr ffff8881663cbca1 by task syz.0.17/5964
CPU: 0 UID: 0 PID: 5964 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xba/0x230 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
string_nocheck lib/vsprintf.c:654 [inline]
string+0x231/0x2b0 lib/vsprintf.c:736
vsnprintf+0x739/0xee0 lib/vsprintf.c:2947
va_format lib/vsprintf.c:1722 [inline]
pointer+0x9b7/0x11f0 lib/vsprintf.c:2568
vsnprintf+0x614/0xee0 lib/vsprintf.c:2951
vprintk_store+0x371/0xd50 kernel/printk/printk.c:2255
vprintk_emit+0x192/0x560 kernel/printk/printk.c:2402
_printk+0xdd/0x130 kernel/printk/printk.c:2451
__dynamic_pr_debug+0x1a2/0x260 lib/dynamic_debug.c:879
nfc_llcp_wks_sap net/nfc/llcp_core.c:344 [inline]
nfc_llcp_get_sdp_ssap+0x3a5/0x440 net/nfc/llcp_core.c:420
llcp_sock_bind+0x3d6/0x780 net/nfc/llcp_sock.c:114
__sys_bind_socket net/socket.c:1874 [inline]
__sys_bind+0x2e3/0x410 net/socket.c:1905
__do_sys_bind net/socket.c:1910 [inline]
__se_sys_bind net/socket.c:1908 [inline]
__x64_sys_bind+0x7a/0x90 net/socket.c:1908
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf40 arch/x86/entry/syscall_64.c:98
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5158f9c799
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffe7ca57bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 00007f5159215fa0 RCX: 00007f5158f9c799
RDX: 0000000000000060 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007f5159032c99 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5159215fac R14: 00007f5159215fa0 R15: 00007f5159215fa0
</TASK>
Allocated by task 5964:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__do_kmalloc_node mm/slub.c:5657 [inline]
__kmalloc_node_track_caller_noprof+0x558/0x7f0 mm/slub.c:5768
kmemdup_noprof+0x2b/0x70 mm/util.c:138
kmemdup_noprof include/linux/fortify-string.h:765 [inline]
llcp_sock_bind+0x392/0x780 net/nfc/llcp_sock.c:107
__sys_bind_socket net/socket.c:1874 [inline]
__sys_bind+0x2e3/0x410 net/socket.c:1905
__do_sys_bind net/socket.c:1910 [inline]
__se_sys_bind net/socket.c:1908 [inline]
__x64_sys_bind+0x7a/0x90 net/socket.c:1908
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf40 arch/x86/entry/syscall_64.c:98
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff8881663cbca0
which belongs to the cache kmalloc-8 of size 8
The buggy address is located 0 bytes to the right of
allocated 1-byte region [ffff8881663cbca0, ffff8881663cbca1)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881663cb7e0 pfn:0x1663cb
anon flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 057ff00000000000 ffff888100041500 0000000000000000 dead000000000001
raw: ffff8881663cb7e0 0000000080800078 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 1, tgid 1 (swapper/0), ts 3460727604, free_ts 3423843770
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x228/0x280 mm/page_alloc.c:1884
prep_new_page mm/page_alloc.c:1892 [inline]
get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3945
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5240
alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2486
alloc_slab_page mm/slub.c:3075 [inline]
allocate_slab+0x86/0x3a0 mm/slub.c:3248
new_slab mm/slub.c:3302 [inline]
___slab_alloc+0xd82/0x1760 mm/slub.c:4656
__slab_alloc+0x65/0x100 mm/slub.c:4779
__slab_alloc_node mm/slub.c:4855 [inline]
slab_alloc_node mm/slub.c:5251 [inline]
__do_kmalloc_node mm/slub.c:5656 [inline]
__kmalloc_noprof+0x46c/0x7e0 mm/slub.c:5669
kmalloc_noprof include/linux/slab.h:961 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
acpi_ns_internalize_name+0x2c9/0x3e0 drivers/acpi/acpica/nsutils.c:331
acpi_ns_get_node_unlocked+0x186/0x480 drivers/acpi/acpica/nsutils.c:666
acpi_ns_get_node+0x76/0xc0 drivers/acpi/acpica/nsutils.c:726
acpi_ns_evaluate+0x283/0x1230 drivers/acpi/acpica/nseval.c:62
acpi_evaluate_object+0x657/0xd50 drivers/acpi/acpica/nsxfeval.c:354
acpi_get_physical_device_location+0xa0/0x2d0 drivers/acpi/utils.c:504
acpi_store_pld_crc drivers/acpi/scan.c:728 [inline]
acpi_device_add+0x6c4/0x940 drivers/acpi/scan.c:787
acpi_add_single_object+0x1621/0x1b70 drivers/acpi/scan.c:1910
page last free pid 1 tgid 1 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1433 [inline]
__free_frozen_pages+0xbf8/0xd70 mm/page_alloc.c:2973
discard_slab mm/slub.c:3346 [inline]
__put_partials+0x146/0x170 mm/slub.c:3886
__slab_free+0x294/0x320 mm/slub.c:5956
qlink_free mm/kasan/quarantine.c:163 [inline]
qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179
kasan_quarantine_remove_cache+0x1ca/0x360 mm/kasan/quarantine.c:364
kmem_cache_shrink+0xd/0x20 mm/slab_common.c:564
acpi_os_purge_cache+0x15/0x20 drivers/acpi/osl.c:1605
acpi_purge_cached_objects+0xd5/0x100 drivers/acpi/acpica/utxface.c:240
acpi_initialize_objects+0x2e/0xb0 drivers/acpi/acpica/utxfinit.c:250
acpi_bus_init+0xaf/0x570 drivers/acpi/bus.c:1367
acpi_init+0xa1/0x1f0 drivers/acpi/bus.c:1456
do_one_initcall+0x250/0x840 init/main.c:1378
do_initcall_level+0x104/0x190 init/main.c:1440
do_initcalls+0x59/0xa0 init/main.c:1456
kernel_init_freeable+0x2a6/0x3d0 init/main.c:1688
kernel_init+0x1d/0x1d0 init/main.c:1578
Memory state around the buggy address:
ffff8881663cbb80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
ffff8881663cbc00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
>ffff8881663cbc80: fa fc fc fc 01 fc fc fc 00 fc fc fc fa fc fc fc
^
ffff8881663cbd00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
ffff8881663cbd80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
==================================================================
***
KASAN: slab-use-after-free Read in __dynamic_dev_dbg
tree: linux-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base: f9d6fc9557e68b48253818870d002dc4784cb2f1
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/c63282ed-913a-4673-b0e4-cd21246874b2/config
syz repro: https://ci.syzbot.org/findings/a49a383b-f0f8-498a-9415-6a927fc7d4b7/syz_repro
==================================================================
BUG: KASAN: slab-use-after-free in dev_driver_string+0x35/0xd0 drivers/base/core.c:2406
Read of size 8 at addr ffff8881137960e0 by task syz.0.17/6043
CPU: 1 UID: 0 PID: 6043 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xba/0x230 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
dev_driver_string+0x35/0xd0 drivers/base/core.c:2406
__dynamic_dev_dbg+0x1ae/0x2e0 lib/dynamic_debug.c:906
display_close+0x1f9/0x240 drivers/media/rc/imon.c:576
__fput+0x44f/0xa70 fs/file_table.c:469
task_work_run+0x1d9/0x270 kernel/task_work.c:233
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0x69b/0x2310 kernel/exit.c:971
do_group_exit+0x21b/0x2d0 kernel/exit.c:1112
get_signal+0x1284/0x1330 kernel/signal.c:3034
arch_do_signal_or_restart+0xbc/0x830 arch/x86/kernel/signal.c:337
__exit_to_user_mode_loop kernel/entry/common.c:41 [inline]
exit_to_user_mode_loop+0x86/0x480 kernel/entry/common.c:75
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline]
do_syscall_64+0x2b7/0xf40 arch/x86/entry/syscall_64.c:104
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb43639c799
Code: Unable to access opcode bytes at 0x7fb43639c76f.
RSP: 002b:00007fb4372890e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007fb436616188 RCX: 00007fb43639c799
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fb436616188
RBP: 00007fb436616180 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb436616218 R14: 00007ffeece49910 R15: 00007ffeece499f8
</TASK>
Allocated by task 5990:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780
kmalloc_noprof include/linux/slab.h:957 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
usb_set_configuration+0x3c9/0x2110 drivers/usb/core/message.c:2037
usb_generic_driver_probe+0x8d/0x150 drivers/usb/core/generic.c:250
usb_probe_device+0x1c4/0x3b0 drivers/usb/core/driver.c:291
call_driver_probe drivers/base/dd.c:-1 [inline]
really_probe+0x267/0xaf0 drivers/base/dd.c:661
__driver_probe_device+0x18c/0x320 drivers/base/dd.c:803
driver_probe_device+0x4f/0x240 drivers/base/dd.c:833
__device_attach_driver+0x279/0x430 drivers/base/dd.c:961
bus_for_each_drv+0x258/0x2f0 drivers/base/bus.c:500
__device_attach+0x2c5/0x450 drivers/base/dd.c:1033
device_initial_probe+0xa1/0xd0 drivers/base/dd.c:1088
bus_probe_device+0x12a/0x220 drivers/base/bus.c:574
device_add+0x7b6/0xb70 drivers/base/core.c:3689
usb_new_device+0xa08/0x16f0 drivers/usb/core/hub.c:2695
hub_port_connect drivers/usb/core/hub.c:5567 [inline]
hub_port_connect_change drivers/usb/core/hub.c:5707 [inline]
port_event drivers/usb/core/hub.c:5871 [inline]
hub_event+0x2a1c/0x4f30 drivers/usb/core/hub.c:5953
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
kthread+0x726/0x8b0 kernel/kthread.c:463
ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
Freed by task 9:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2540 [inline]
slab_free mm/slub.c:6674 [inline]
kfree+0x1be/0x650 mm/slub.c:6886
device_release+0x9e/0x1d0 drivers/base/core.c:-1
kobject_cleanup lib/kobject.c:689 [inline]
kobject_release lib/kobject.c:720 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x228/0x560 lib/kobject.c:737
usb_disable_device+0x611/0x8d0 drivers/usb/core/message.c:1425
usb_disconnect+0x32f/0x990 drivers/usb/core/hub.c:2345
hub_port_connect drivers/usb/core/hub.c:5407 [inline]
hub_port_connect_change drivers/usb/core/hub.c:5707 [inline]
port_event drivers/usb/core/hub.c:5871 [inline]
hub_event+0x1cc9/0x4f30 drivers/usb/core/hub.c:5953
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
kthread+0x726/0x8b0 kernel/kthread.c:463
ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
The buggy address belongs to the object at ffff888113796000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 224 bytes inside of
freed 2048-byte region [ffff888113796000, ffff888113796800)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x113790
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000040 ffff888100042000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 017ff00000000040 ffff888100042000 dead000000000122 0000000000000000
head: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 017ff00000000003 ffffea00044de401 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 6019, tgid 6019 (kworker/0:5), ts 128207296876, free_ts 125356693648
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x228/0x280 mm/page_alloc.c:1884
prep_new_page mm/page_alloc.c:1892 [inline]
get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3945
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5240
alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2486
alloc_slab_page mm/slub.c:3075 [inline]
allocate_slab+0x86/0x3a0 mm/slub.c:3248
new_slab mm/slub.c:3302 [inline]
___slab_alloc+0xd82/0x1760 mm/slub.c:4656
__slab_alloc+0x65/0x100 mm/slub.c:4779
__slab_alloc_node mm/slub.c:4855 [inline]
slab_alloc_node mm/slub.c:5251 [inline]
__do_kmalloc_node mm/slub.c:5656 [inline]
__kmalloc_node_track_caller_noprof+0x5b7/0x7f0 mm/slub.c:5768
kmalloc_reserve+0x136/0x290 net/core/skbuff.c:608
__alloc_skb+0x204/0x390 net/core/skbuff.c:690
alloc_skb include/linux/skbuff.h:1383 [inline]
mld_newpack+0x14c/0xc90 net/ipv6/mcast.c:1775
add_grhead+0x5a/0x2a0 net/ipv6/mcast.c:1886
add_grec+0x1452/0x1740 net/ipv6/mcast.c:2025
mld_send_cr net/ipv6/mcast.c:2148 [inline]
mld_ifc_work+0x6e6/0xe70 net/ipv6/mcast.c:2693
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3340
worker_thread+0xda6/0x1360 kernel/workqueue.c:3421
page last free pid 5243 tgid 5243 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1433 [inline]
__free_frozen_pages+0xbf8/0xd70 mm/page_alloc.c:2973
discard_slab mm/slub.c:3346 [inline]
__put_partials+0x146/0x170 mm/slub.c:3886
__slab_free+0x294/0x320 mm/slub.c:5956
qlink_free mm/kasan/quarantine.c:163 [inline]
qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179
kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
__kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4953 [inline]
slab_alloc_node mm/slub.c:5263 [inline]
kmem_cache_alloc_node_noprof+0x427/0x6f0 mm/slub.c:5315
__alloc_skb+0x1d7/0x390 net/core/skbuff.c:679
alloc_skb include/linux/skbuff.h:1383 [inline]
alloc_skb_with_frags+0xca/0x890 net/core/skbuff.c:6715
sock_alloc_send_pskb+0x878/0x990 net/core/sock.c:2995
unix_dgram_sendmsg+0x460/0x18e0 net/unix/af_unix.c:2130
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
__sys_sendto+0x709/0x7a0 net/socket.c:2206
__do_sys_sendto net/socket.c:2213 [inline]
__se_sys_sendto net/socket.c:2209 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2209
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf40 arch/x86/entry/syscall_64.c:98
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Memory state around the buggy address:
ffff888113795f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888113796000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888113796080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888113796100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888113796180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init
2026-03-24 9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
@ 2026-03-24 15:17 ` Shrikanth Hegde
2026-03-24 19:46 ` Valentin Schneider
0 siblings, 1 reply; 14+ messages in thread
From: Shrikanth Hegde @ 2026-03-24 15:17 UTC (permalink / raw)
To: Valentin Schneider, linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik
On 3/24/26 3:17 PM, Valentin Schneider wrote:
> housekeeping_overridden is only ever enabled in the __init function
> housekeeping_init(), and is never disabled. Mark it __ro_after_init.
>
what about housekeeping_update which could be via isolated_cpus_update
when creating isolated cpusets.
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/isolation.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 3ad0d6df6a0a2..54d1d93cdeea5 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -16,7 +16,7 @@ enum hk_flags {
> HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
> };
>
> -DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
> +DEFINE_STATIC_KEY_FALSE_RO(housekeeping_overridden);
> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>
> struct housekeeping {
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init
2026-03-24 15:17 ` Shrikanth Hegde
@ 2026-03-24 19:46 ` Valentin Schneider
0 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2026-03-24 19:46 UTC (permalink / raw)
To: Shrikanth Hegde, linux-kernel, linux-mm, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
Arnaldo Carvalho de Melo, Josh Poimboeuf, Paolo Bonzini,
Arnd Bergmann, Frederic Weisbecker, Paul E. McKenney, Jason Baron,
Steven Rostedt, Ard Biesheuvel, Sami Tolvanen, David S. Miller,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Mel Gorman, Andrew Morton,
Masahiro Yamada, Han Shen, Rik van Riel, Jann Horn, Dan Carpenter,
Oleg Nesterov, Juri Lelli, Clark Williams, Tomas Glozar,
Yair Podemsky, Marcelo Tosatti, Daniel Wagner, Petr Tesarik
On 24/03/26 20:47, Shrikanth Hegde wrote:
> On 3/24/26 3:17 PM, Valentin Schneider wrote:
>> housekeeping_overridden is only ever enabled in the __init function
>> housekeeping_init(), and is never disabled. Mark it __ro_after_init.
>>
>
> what about housekeeping_update which could be via isolated_cpus_update
> when creating isolated cpusets.
>
Doh, I'd even seen the patches but forgot to make a note of em. So yeah,
that's not __init and the key is now flippable at runtime.
I suppose I could resurrect:
https://lore.kernel.org/lkml/20251114150133.1056710-6-vschneid@redhat.com/
+ the is_kernel_noinstr_text() thing from
https://lore.kernel.org/lkml/20251114151428.1064524-5-vschneid@redhat.com/
and have the IPI associated with flipping that key never be deferred, so if
we suddenly have some CPU isolation, every CPU gets poked.
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>> kernel/sched/isolation.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 3ad0d6df6a0a2..54d1d93cdeea5 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -16,7 +16,7 @@ enum hk_flags {
>> HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
>> };
>>
>> -DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
>> +DEFINE_STATIC_KEY_FALSE_RO(housekeeping_overridden);
>> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>>
>> struct housekeeping {
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-03-24 19:46 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints() Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
2026-03-24 15:17 ` Shrikanth Hegde
2026-03-24 19:46 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2026-03-24 9:48 ` [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Valentin Schneider
2026-03-24 9:48 ` [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush " Valentin Schneider
2026-03-24 15:01 ` [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition syzbot ci
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox