* [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2024-11-19 15:34 Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
` (15 more replies)
0 siblings, 16 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Patches
=======
o Patches 1-3 are standalone cleanups.
o Patches 4-5 add an RCU testing feature.
o Patches 6-8 add a new type of jump label for static keys that will not have
their IPI be deferred.
o Patch 9 adds objtool verification of static keys vs their text_poke IPI
deferral
o Patches 10-14 add the actual IPI deferrals
o Patch 15 is a freebie to enable the deferral feature for NO_HZ_IDLE
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v3
RFC status
==========
Things I'd like to get comments on and/or that are a bit WIPish; they're called
out in the individual changelogs:
o "forceful" jump label naming which I don't particularly like
o objtool usage of 'offset_of(static_key.type)' and JUMP_TYPE_FORCEFUL. I've
hardcoded them but it could do with being shoved in a kernel header objtool
can include directly
o The noinstr variant of __flush_tlb_all() doesn't have a paravirt variant, does
it need one?
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 1hr.
v6.12-rc4
# This is the actual IPI count
$ trace-cmd report trace-base.dat | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
1782 callback=generic_smp_call_function_single_interrupt+0x0
73 callback=0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
22048 func=tlb_remove_table_smp_sync
16536 func=do_sync_core
2262 func=do_flush_tlb_all
182 func=do_kernel_range_flush
144 func=rcu_exp_handler
60 func=sched_ttwu_pending
v6.12-rc4 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
1168 callback=generic_smp_call_function_single_interrupt+0x0
74 callback=0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
23686 func=tlb_remove_table_smp_sync
192 func=rcu_exp_handler
65 func=sched_ttwu_pending
Interestingly tlb_remove_table_smp_sync() started showing up on this machine,
while it didn't during testing for v2 and it's the same machine. Yair had a
series adressing this [5] which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
.data..ro_after_init section.
o All of the folks who attended various talks about this and provided precious
feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
RFCv2 -> RFCv3
+++++++++++
o Rebased onto v6.12-rc7
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
o Overall changelog & comments cleanup
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Valentin Schneider (15):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
sched/clock: Make sched_clock_running __ro_after_init
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
jump_label: Add forceful jump label type
x86/speculation/mds: Make mds_idle_clear forceful
sched/clock, x86: Make __sched_clock_stable forceful
objtool: Warn about non __ro_after_init static key usage in .noinstr
x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
context_tracking,x86: Add infrastructure to defer kernel TLBI
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
arch/Kconfig | 9 +++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 20 +++++++
arch/x86/include/asm/special_insns.h | 1 +
arch/x86/include/asm/text-patching.h | 13 ++++-
arch/x86/include/asm/tlbflush.h | 17 +++++-
arch/x86/kernel/alternative.c | 49 ++++++++++++----
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/cpu/common.c | 6 +-
arch/x86/kernel/jump_label.c | 7 ++-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/tlb.c | 49 ++++++++++++++--
include/linux/context_tracking.h | 21 +++++++
include/linux/context_tracking_state.h | 54 ++++++++++++++---
include/linux/context_tracking_work.h | 28 +++++++++
include/linux/jump_label.h | 26 ++++++---
kernel/context_tracking.c | 46 ++++++++++++++-
kernel/rcu/Kconfig.debug | 14 +++++
kernel/sched/clock.c | 4 +-
kernel/time/Kconfig | 19 ++++++
mm/vmalloc.c | 35 +++++++++--
tools/objtool/Documentation/objtool.txt | 13 +++++
tools/objtool/check.c | 58 ++++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 +
tools/objtool/special.c | 3 +
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
29 files changed, 450 insertions(+), 59 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.43.0
^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[]
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 20:38 ` Josh Poimboeuf
2024-11-19 15:34 ` [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
` (14 subsequent siblings)
15 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
tools/objtool/check.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 6604f5d038aad..5f1d0f95fc04b 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
return false;
}
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn)
{
static char pvname[19];
struct reloc *reloc;
@@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct instruction *insn)
if (insn_call_dest(insn))
return insn_call_dest(insn)->name;
- reloc = insn_reloc(NULL, insn);
+ reloc = insn_reloc(file, insn);
if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
- idx = (reloc_addend(reloc) / sizeof(void *));
+ idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *));
snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
return pvname;
}
@@ -3538,17 +3538,20 @@ static int validate_call(struct objtool_file *file,
{
if (state->noinstr && state->instr <= 0 &&
!noinstr_call_dest(file, insn, insn_call_dest(insn))) {
- WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() leaves .noinstr.text section",
+ call_dest_name(file, insn));
return 1;
}
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
- WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() with UACCESS enabled",
+ call_dest_name(file, insn));
return 1;
}
if (state->df) {
- WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn));
+ WARN_INSN(insn, "call to %s() with DF set",
+ call_dest_name(file, insn));
return 1;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 20:38 ` Josh Poimboeuf
2024-11-19 15:34 ` [RFC PATCH v3 03/15] sched/clock: Make sched_clock_running __ro_after_init Valentin Schneider
` (13 subsequent siblings)
15 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
I had to look into objtool itself to understand what this warning was
about; make it more explicit.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
tools/objtool/check.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5f1d0f95fc04b..00e25492f5065 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
if (!target->sec->noinstr) {
- WARN("pv_ops[%d]: %s", idx, target->name);
+ WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
file->pv_ops[idx].clean = false;
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 03/15] sched/clock: Make sched_clock_running __ro_after_init
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 04/15] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
` (12 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
sched_clock_running is only ever enabled in the __init functions
sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark
it __ro_after_init.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/clock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index a09655b481402..200e5568b9894 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void)
}
EXPORT_SYMBOL_GPL(sched_clock);
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
+static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 04/15] rcu: Add a small-width RCU watching counter debug option
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (2 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 03/15] sched/clock: Make sched_clock_running __ro_after_init Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 05/15] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
` (11 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Paul E . McKenney, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++----
kernel/rcu/Kconfig.debug | 14 ++++++++
2 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 7b8433d5a8efe..0b81248aa03e2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -18,12 +18,6 @@ enum ctx_state {
CT_STATE_MAX = 4,
};
-/* Odd value for watching, else even. */
-#define CT_RCU_WATCHING CT_STATE_MAX
-
-#define CT_STATE_MASK (CT_STATE_MAX - 1)
-#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
-
struct context_tracking {
#ifdef CONFIG_CONTEXT_TRACKING_USER
/*
@@ -44,9 +38,45 @@ struct context_tracking {
#endif
};
+/*
+ * We cram two different things within the same atomic variable:
+ *
+ * CT_RCU_WATCHING_START CT_STATE_START
+ * | |
+ * v v
+ * MSB [ RCU watching counter ][ context_state ] LSB
+ * ^ ^
+ * | |
+ * CT_RCU_WATCHING_END CT_STATE_END
+ *
+ * Bits are used from the LSB upwards, so unused bits (if any) will always be in
+ * upper bits of the variable.
+ */
#ifdef CONFIG_CONTEXT_TRACKING
+#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1)
+#define CT_STATE_START 0
+#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
+#define CT_RCU_WATCHING_START (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
+#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
+
+#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START)
+#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
+
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+
+static_assert(CT_STATE_WIDTH +
+ CT_RCU_WATCHING_WIDTH +
+ CT_UNUSED_WIDTH ==
+ CT_SIZE);
+
DECLARE_PER_CPU(struct context_tracking, context_tracking);
-#endif
+#endif /* CONFIG_CONTEXT_TRACKING */
#ifdef CONFIG_CONTEXT_TRACKING_USER
static __always_inline int __ct_state(void)
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 9b0b52e1836fa..8dc505d841f8d 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -168,4 +168,18 @@ config RCU_STRICT_GRACE_PERIOD
when looking for certain types of RCU usage bugs, for example,
too-short RCU read-side critical sections.
+
+config RCU_DYNTICKS_TORTURE
+ bool "Minimize RCU dynticks counter size"
+ depends on RCU_EXPERT
+ default n
+ help
+ This option controls the width of the dynticks counter.
+
+ Lower values will make overflows more frequent, which will increase
+ the likelihood of extending grace-periods. This option sets the width
+ to its minimum usable value.
+
+ This has no value for production and is only for testing.
+
endmenu # "RCU Debugging"
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 05/15] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (3 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 04/15] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 06/15] jump_label: Add forceful jump label type Valentin Schneider
` (10 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Paul E . McKenney, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
We now have an RCU_EXPERT config for testing small-sized RCU dynticks
counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a
ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..67caf4276bb01 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_EQS_DEBUG=y
CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 06/15] jump_label: Add forceful jump label type
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (4 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 05/15] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 23:39 ` Josh Poimboeuf
2024-11-20 0:05 ` Josh Poimboeuf
2024-11-19 15:34 ` [RFC PATCH v3 07/15] x86/speculation/mds: Make mds_idle_clear forceful Valentin Schneider
` (9 subsequent siblings)
15 siblings, 2 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Later commits will cause objtool to warn about non __ro_after_init static
keys being used in .noinstr sections in order to safely defer instruction
patching IPIs targeted at NOHZ_FULL CPUs.
Two such keys currently exist: mds_idle_clear and __sched_clock_stable,
which can both be modified at runtime.
As discussed at LPC 2024 during the realtime micro-conference, modifying
these specific keys incurs additional interference (SMT hotplug) or can
even be downright incompatible with NOHZ_FULL (unstable TSC).
Suppressing the IPI associated with modifying such keys is thus a minor
concern wrt NOHZ_FULL interference, so add a jump type that will be
leveraged by both the kernel (to know not to defer the IPI) and objtool (to
know not to generate the aforementioned warning).
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/jump_label.h | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f5a2727ca4a9a..93e729545b941 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -200,7 +200,8 @@ struct module;
#define JUMP_TYPE_FALSE 0UL
#define JUMP_TYPE_TRUE 1UL
#define JUMP_TYPE_LINKED 2UL
-#define JUMP_TYPE_MASK 3UL
+#define JUMP_TYPE_FORCEFUL 4UL
+#define JUMP_TYPE_MASK 7UL
static __always_inline bool static_key_false(struct static_key *key)
{
@@ -244,12 +245,15 @@ extern enum jump_label_type jump_label_init_type(struct jump_entry *entry);
* raw value, but have added a BUILD_BUG_ON() to catch any issues in
* jump_label_init() see: kernel/jump_label.c.
*/
-#define STATIC_KEY_INIT_TRUE \
- { .enabled = { 1 }, \
- { .type = JUMP_TYPE_TRUE } }
-#define STATIC_KEY_INIT_FALSE \
- { .enabled = { 0 }, \
- { .type = JUMP_TYPE_FALSE } }
+#define __STATIC_KEY_INIT(_true, force) \
+ { .enabled = { _true }, \
+ { .type = (_true ? JUMP_TYPE_TRUE : JUMP_TYPE_FALSE) | \
+ (force ? JUMP_TYPE_FORCEFUL : 0UL)} }
+
+#define STATIC_KEY_INIT_TRUE __STATIC_KEY_INIT(true, false)
+#define STATIC_KEY_INIT_FALSE __STATIC_KEY_INIT(false, false)
+#define STATIC_KEY_INIT_TRUE_FORCE __STATIC_KEY_INIT(true, true)
+#define STATIC_KEY_INIT_FALSE_FORCE __STATIC_KEY_INIT(false, true)
#else /* !CONFIG_JUMP_LABEL */
@@ -369,6 +373,8 @@ struct static_key_false {
#define STATIC_KEY_TRUE_INIT (struct static_key_true) { .key = STATIC_KEY_INIT_TRUE, }
#define STATIC_KEY_FALSE_INIT (struct static_key_false){ .key = STATIC_KEY_INIT_FALSE, }
+#define STATIC_KEY_TRUE_FORCE_INIT (struct static_key_true) { .key = STATIC_KEY_INIT_TRUE_FORCE, }
+#define STATIC_KEY_FALSE_FORCE_INIT (struct static_key_false){ .key = STATIC_KEY_INIT_FALSE_FORCE, }
#define DEFINE_STATIC_KEY_TRUE(name) \
struct static_key_true name = STATIC_KEY_TRUE_INIT
@@ -376,6 +382,9 @@ struct static_key_false {
#define DEFINE_STATIC_KEY_TRUE_RO(name) \
struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_FORCE(name) \
+ struct static_key_true name = STATIC_KEY_TRUE_FORCE_INIT
+
#define DECLARE_STATIC_KEY_TRUE(name) \
extern struct static_key_true name
@@ -385,6 +394,9 @@ struct static_key_false {
#define DEFINE_STATIC_KEY_FALSE_RO(name) \
struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_FORCE(name) \
+ struct static_key_false name = STATIC_KEY_FALSE_FORCE_INIT
+
#define DECLARE_STATIC_KEY_FALSE(name) \
extern struct static_key_false name
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 07/15] x86/speculation/mds: Make mds_idle_clear forceful
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (5 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 06/15] jump_label: Add forceful jump label type Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 08/15] sched/clock, x86: Make __sched_clock_stable forceful Valentin Schneider
` (8 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Later commits will cause objtool to warn about non __ro_after_init static
keys being used in .noinstr sections in order to safely defer instruction
patching IPIs targeted at NOHZ_FULL CPUs.
mds_idle_clear is used in .noinstr code, and can be modified at
runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little
benefits for this key, as hotplug implies eventually going through
takedown_cpu() -> stop_machine_cpuslocked() which is going to cause
interference on all online CPUs anyway.
Mark it as forceful to let the kernel know to always send the
text_poke_sync() IPI for it, and to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/kernel/cpu/bugs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 47a01d4028f60..dcc4d5d6e3b95 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -114,7 +114,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
/* Control MDS CPU buffer clear before idling (halt, mwait) */
-DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+DEFINE_STATIC_KEY_FALSE_FORCE(mds_idle_clear);
EXPORT_SYMBOL_GPL(mds_idle_clear);
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 08/15] sched/clock, x86: Make __sched_clock_stable forceful
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (6 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 07/15] x86/speculation/mds: Make mds_idle_clear forceful Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 09/15] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
` (7 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Later commits will cause objtool to warn about non __ro_after_init static
keys being used in .noinstr sections in order to safely defer instruction
patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at
runtime (e.g. KVM module loading). Suppressing the text_poke_sync() IPI has
little benefits for this key, as NOHZ_FULL is incompatible with an unstable
TSC anyway.
Mark it as forceful to let the kernel know to always send the
text_poke_sync() IPI for it, and to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/clock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 200e5568b9894..dc94b3717f5ce 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -76,7 +76,7 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
* Similarly we start with __sched_clock_stable_early, thereby assuming we
* will become stable, such that there's only a single 1 -> 0 transition.
*/
-static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
+static DEFINE_STATIC_KEY_FALSE_FORCE(__sched_clock_stable);
static int __sched_clock_stable_early = 1;
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 09/15] objtool: Warn about non __ro_after_init static key usage in .noinstr
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (7 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 08/15] sched/clock, x86: Make __sched_clock_stable forceful Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 10/15] x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels Valentin Schneider
` (6 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Josh Poimboeuf, Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Later commits will disallow runtime-mutable text in .noinstr sections in
order to safely defer instruction patching IPIs.
All static keys used in .noinstr sections have now been checked as being
either flagged as __ro_after_init, or as forceful static keys. Any
occurrence of this new warning would be the result of a code change that
will need looking at.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
offset_of(static_key.type) and JUMP_TYPE_FORCEFUL would need to be shoved
into a somewhat standalone header file that could be included by objtool
itself.
---
tools/objtool/Documentation/objtool.txt | 13 ++++++++
tools/objtool/check.c | 41 +++++++++++++++++++++++++
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 ++
tools/objtool/special.c | 3 ++
5 files changed, 60 insertions(+)
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt
index 7c3ee959b63c7..06fa285873387 100644
--- a/tools/objtool/Documentation/objtool.txt
+++ b/tools/objtool/Documentation/objtool.txt
@@ -447,6 +447,19 @@ the objtool maintainers.
names and does not use module_init() / module_exit() macros to create
them.
+13. file.o: warning: func()+0x2a: non __ro_after_init static key "key" in .noinstr section
+
+ This means that the noinstr function func() uses a static key that can be
+ modified at runtime. This is not allowed as noinstr functions rely on
+ containing stable instructions after init.
+
+ Check whether the static key in question can really be modified at runtime,
+ and if it is only enabled during init then mark it as __ro_after_init. If it
+ genuinely needs to be modified at runtime:
+
+ 1) Directly rely on the underlying atomic count of they key in the noinstr
+ functions.
+ 2) Mark the static key as forceful.
If the error doesn't seem to make sense, it could be a bug in objtool.
Feel free to ask the objtool maintainer for help.
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 00e25492f5065..c1fb02c326839 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2056,6 +2056,9 @@ static int add_special_section_alts(struct objtool_file *file)
alt->next = orig_insn->alts;
orig_insn->alts = alt;
+ if (special_alt->key_sym)
+ orig_insn->key_sym = special_alt->key_sym;
+
list_del(&special_alt->list);
free(special_alt);
}
@@ -3605,6 +3608,41 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
return 0;
}
+static bool static_key_is_forceful(struct symbol *key)
+{
+ if (!strcmp(key->sec->name, ".data")) {
+ unsigned long data_offset = key->sec->data->d_off;
+ unsigned long key_offset = key->sym.st_value;
+ char* data = key->sec->data->d_buf;
+
+ /*
+ * offset_of(static_key.type)
+ * v
+ * v JUMP_TYPE_FORCEFUL
+ * v v
+ */
+ return data[(key_offset + 8) - data_offset] & 0x4;
+ }
+
+ return false;
+}
+
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+ if (state->noinstr && state->instr <= 0) {
+ if (static_key_is_forceful(insn->key_sym))
+ return 0;
+
+ if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
+ WARN_INSN(insn, "non __ro_after_init static key \"%s\" in .noinstr section",
+ insn->key_sym->name);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
static struct instruction *next_insn_to_validate(struct objtool_file *file,
struct instruction *insn)
{
@@ -3766,6 +3804,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
if (handle_insn_ops(insn, next_insn, &state))
return 1;
+ if (insn->key_sym)
+ validate_static_key(insn, &state);
+
switch (insn->type) {
case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index daa46f1f0965a..35dd21f8f41e1 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -77,6 +77,7 @@ struct instruction {
struct symbol *sym;
struct stack_op *stack_ops;
struct cfi_state *cfi;
+ struct symbol *key_sym;
};
static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 86d4af9c5aa9d..0e61f34fe3a28 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -27,6 +27,8 @@ struct special_alt {
struct section *new_sec;
unsigned long new_off;
+ struct symbol *key_sym;
+
unsigned int orig_len, new_len; /* group only */
};
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 097a69db82a0e..fefab2b471bf8 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
return -1;
}
alt->key_addend = reloc_addend(key_reloc);
+
+ reloc_to_sec_off(key_reloc, &sec, &offset);
+ alt->key_sym = find_symbol_by_offset(sec, offset & ~3);
}
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 10/15] x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (8 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 09/15] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure Valentin Schneider
` (5 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Forceful static keys are used in early entry code where it is unsafe to
defer the sync_core() IPIs, and flagged as such via their ->type field.
Record that information when creating a text_poke_loc. The
text_poke_loc.old field is written to when first iterating a text_poke()
entry, and as such can be (ab)used to store this information at the start
of text_poke_bp_batch().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/text-patching.h | 12 ++++++++++--
arch/x86/kernel/alternative.c | 16 ++++++++++------
arch/x86/kernel/jump_label.c | 7 ++++---
3 files changed, 24 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 6259f1937fe77..e34de36cab61e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,9 +38,17 @@ extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
extern void *text_poke_set(void *addr, int c, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
+extern void __text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi);
+static inline void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
+{
+ __text_poke_bp(addr, opcode, len, emulate, false);
+}
-extern void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate);
+extern void __text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi);
+static inline void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate)
+{
+ __text_poke_queue(addr, opcode, len, emulate, false);
+}
extern void text_poke_finish(void);
#define INT3_INSN_SIZE 1
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8b..954c4c0f7fc58 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2098,7 +2098,10 @@ struct text_poke_loc {
u8 opcode;
const u8 text[POKE_MAX_OPCODE_SIZE];
/* see text_poke_bp_batch() */
- u8 old;
+ union {
+ u8 old;
+ u8 force_ipi;
+ };
};
struct bp_patching_desc {
@@ -2385,7 +2388,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
}
static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
- const void *opcode, size_t len, const void *emulate)
+ const void *opcode, size_t len, const void *emulate, bool force_ipi)
{
struct insn insn;
int ret, i = 0;
@@ -2402,6 +2405,7 @@ static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
tp->rel_addr = addr - (void *)_stext;
tp->len = len;
tp->opcode = insn.opcode.bytes[0];
+ tp->force_ipi = force_ipi;
if (is_jcc32(&insn)) {
/*
@@ -2493,14 +2497,14 @@ void text_poke_finish(void)
text_poke_flush(NULL);
}
-void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate)
+void __ref __text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi)
{
struct text_poke_loc *tp;
text_poke_flush(addr);
tp = &tp_vec[tp_vec_nr++];
- text_poke_loc_init(tp, addr, opcode, len, emulate);
+ text_poke_loc_init(tp, addr, opcode, len, emulate, force_ipi);
}
/**
@@ -2514,10 +2518,10 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
* dynamically allocated memory. This function should be used when it is
* not possible to allocate memory.
*/
-void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
+void __ref __text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi)
{
struct text_poke_loc tp;
- text_poke_loc_init(&tp, addr, opcode, len, emulate);
+ text_poke_loc_init(&tp, addr, opcode, len, emulate, force_ipi);
text_poke_bp_batch(&tp, 1);
}
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index f5b8ef02d172c..e03a4f56b30fd 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -101,8 +101,8 @@ __jump_label_transform(struct jump_entry *entry,
text_poke_early((void *)jump_entry_code(entry), jlp.code, jlp.size);
return;
}
-
- text_poke_bp((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL);
+ __text_poke_bp((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL,
+ jump_entry_key(entry)->type & JUMP_TYPE_FORCEFUL);
}
static void __ref jump_label_transform(struct jump_entry *entry,
@@ -135,7 +135,8 @@ bool arch_jump_label_transform_queue(struct jump_entry *entry,
mutex_lock(&text_mutex);
jlp = __jump_label_patch(entry, type);
- text_poke_queue((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL);
+ __text_poke_queue((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL,
+ jump_entry_key(entry)->type & JUMP_TYPE_FORCEFUL);
mutex_unlock(&text_mutex);
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (9 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 10/15] x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-20 10:54 ` Frederic Weisbecker
2024-11-19 15:34 ` [RFC PATCH v3 12/15] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
` (4 subsequent siblings)
15 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.
Some callsites can be bent into doing the right, such as done by commit:
cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.
Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.
Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work.
Later commits introduce the bit:callback mappings.
Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/Kconfig | 9 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 14 ++++++
include/linux/context_tracking.h | 21 +++++++++
include/linux/context_tracking_state.h | 30 ++++++++-----
include/linux/context_tracking_work.h | 26 +++++++++++
kernel/context_tracking.c | 46 +++++++++++++++++++-
kernel/time/Kconfig | 5 +++
8 files changed, 140 insertions(+), 12 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
diff --git a/arch/Kconfig b/arch/Kconfig
index bd9f095d69fa0..e7f3f797a34a4 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -912,6 +912,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
- No use of instrumentation, unless instrumentation_begin() got
called.
+config HAVE_CONTEXT_TRACKING_WORK
+ bool
+ help
+ Architecture supports deferring work while not in kernel context.
+ This is especially useful on setups with isolated CPUs that might
+ want to avoid being interrupted to perform housekeeping tasks (for
+ ex. TLB invalidation or icache invalidation). The housekeeping
+ operations are performed upon re-entering the kernel.
+
config HAVE_TIF_NOHZ
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 16354dfa6d965..c703376dd326b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -213,6 +213,7 @@ config X86
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING_USER if X86_64
select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER
+ select HAVE_CONTEXT_TRACKING_WORK if X86_64
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL
select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5bc29e6b2ed38
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+ switch (work) {
+ case CONTEXT_WORK_n:
+ // Do work...
+ break;
+ }
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..16a2eb7525f1f 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/vtime.h>
#include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
#include <linux/instrumentation.h>
#include <asm/ptrace.h>
@@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
}
+#ifdef CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+ struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+ unsigned long new, old, state;
+
+ state = arch_atomic_read(&ct->state);
+ do {
+ old = state;
+ new = old & ~CONTEXT_WORK_MASK;
+ new += incby;
+ state = arch_atomic_cmpxchg(&ct->state, old, new);
+ } while (old != state);
+
+ return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
static __always_inline bool warn_rcu_enter(void)
{
bool ret = false;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 0b81248aa03e2..d1d37dbdf7195 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -5,6 +5,7 @@
#include <linux/percpu.h>
#include <linux/static_key.h>
#include <linux/context_tracking_irq.h>
+#include <linux/context_tracking_work.h>
/* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
#define CT_NESTING_IRQ_NONIDLE ((LONG_MAX / 2) + 1)
@@ -39,16 +40,19 @@ struct context_tracking {
};
/*
- * We cram two different things within the same atomic variable:
+ * We cram up to three different things within the same atomic variable:
*
- * CT_RCU_WATCHING_START CT_STATE_START
- * | |
- * v v
- * MSB [ RCU watching counter ][ context_state ] LSB
- * ^ ^
- * | |
- * CT_RCU_WATCHING_END CT_STATE_END
+ * CT_RCU_WATCHING_START CT_STATE_START
+ * | CT_WORK_START |
+ * | | |
+ * v v v
+ * MSB [ RCU watching counter ][ context work ][ context_state ] LSB
+ * ^ ^ ^
+ * | | |
+ * | CT_WORK_END |
+ * CT_RCU_WATCHING_END CT_STATE_END
*
+ * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n
* Bits are used from the LSB upwards, so unused bits (if any) will always be in
* upper bits of the variable.
*/
@@ -59,18 +63,24 @@ struct context_tracking {
#define CT_STATE_START 0
#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CONTEXT_WORK_MAX_OFFSET : 0)
+#define CT_WORK_START (CT_STATE_END + 1)
+#define CT_WORK_END (CT_WORK_START + CT_WORK_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH)
#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH)
-#define CT_RCU_WATCHING_START (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_START (CT_WORK_END + 1)
#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1)
#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START)
+#define CT_WORK_MASK GENMASK(CT_WORK_END, CT_WORK_START)
#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
static_assert(CT_STATE_WIDTH +
+ CT_WORK_WIDTH +
CT_RCU_WATCHING_WIDTH +
CT_UNUSED_WIDTH ==
CT_SIZE);
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..fb74db8876dd2
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+ CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+ CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 938c48952d265..37b094ea56fb6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,47 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void)
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
}
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+ int bit;
+
+ seq = (seq & CT_WORK_MASK) >> CT_WORK_START;
+
+ /*
+ * arch_context_tracking_work() must be noinstr, non-blocking,
+ * and NMI safe.
+ */
+ for_each_set_bit(bit, &seq, CONTEXT_WORK_MAX)
+ arch_context_tracking_work(BIT(bit));
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+ struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+ unsigned int old;
+ bool ret = false;
+
+ preempt_disable();
+
+ old = atomic_read(&ct->state);
+ /*
+ * Try setting the work until either
+ * - the target CPU has entered kernelspace
+ * - the work has been set
+ */
+ do {
+ ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
+ } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL));
+
+ preempt_enable();
+ return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state, that is,
@@ -88,7 +129,7 @@ static noinstr void ct_kernel_exit_state(int offset)
* next idle sojourn.
*/
rcu_task_trace_heavyweight_enter(); // Before CT state update!
- seq = ct_state_inc(offset);
+ seq = ct_state_inc_clear_work(offset);
// RCU is no longer watching. Better be in extended quiescent state!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & CT_RCU_WATCHING));
}
@@ -100,7 +141,7 @@ static noinstr void ct_kernel_exit_state(int offset)
*/
static noinstr void ct_kernel_enter_state(int offset)
{
- int seq;
+ unsigned long seq;
/*
* CPUs seeing atomic_add_return() must see prior idle sojourns,
@@ -108,6 +149,7 @@ static noinstr void ct_kernel_enter_state(int offset)
* critical section.
*/
seq = ct_state_inc(offset);
+ ct_work_flush(seq);
// RCU is now watching. Better not be in an extended quiescent state!
rcu_task_trace_heavyweight_exit(); // After CT state update!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 8ebb6d5a106be..04efc2b605823 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -186,6 +186,11 @@ config CONTEXT_TRACKING_USER_FORCE
Say N otherwise, this option brings an overhead that you
don't want in production.
+config CONTEXT_TRACKING_WORK
+ bool
+ depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+ default y
+
config NO_HZ
bool "Old Idle dynticks config"
help
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 12/15] context_tracking,x86: Defer kernel text patching IPIs
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (10 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2024-11-19 15:34 ` Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
` (3 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:34 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Peter Zijlstra, Nicolas Saenz Julienne, Steven Rostedt,
Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.
This leaves us with static keys. Any static key used in early entry code
should be only forever-enabled at boot time, IOW __ro_after_init (pretty
much like alternatives). Exceptions are marked as forceful and will always
generate an IPI when flipped. Objtool is now able to point at static keys
that don't respect this, and all static keys used in early entry code have
now been verified as behaving like so.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/context_tracking_work.h | 6 ++--
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 33 +++++++++++++++++---
arch/x86/kernel/kprobes/core.c | 4 +--
arch/x86/kernel/kprobes/opt.c | 4 +--
arch/x86/kernel/module.c | 2 +-
include/linux/context_tracking_work.h | 4 +--
7 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5bc29e6b2ed38..2c66687ce00e2 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h>
+
static __always_inline void arch_context_tracking_work(int work)
{
switch (work) {
- case CONTEXT_WORK_n:
- // Do work...
+ case CONTEXT_WORK_SYNC:
+ sync_core();
break;
}
}
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e34de36cab61e..37344e95f738f 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 954c4c0f7fc58..4ce224d927b03 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
#include <linux/mmu_context.h>
#include <linux/bsearch.h>
#include <linux/sync_core.h>
+#include <linux/context_tracking.h>
#include <asm/text-patching.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -2080,9 +2081,24 @@ static void do_sync_core(void *info)
sync_core();
}
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+ return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+ on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
void text_poke_sync(void)
{
- on_each_cpu(do_sync_core, NULL, 1);
+ __text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+ __text_poke_sync(do_sync_core_defer_cond);
}
/*
@@ -2257,6 +2273,8 @@ static int tp_vec_nr;
static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
{
unsigned char int3 = INT3_INSN_OPCODE;
+ bool force_ipi = false;
+ void (*sync_fn)(void);
unsigned int i;
int do_sync;
@@ -2291,11 +2309,18 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
* First step: add a int3 trap to the address that will be patched.
*/
for (i = 0; i < nr_entries; i++) {
+ /*
+ * Record that we need to send the IPI if at least one location
+ * in the batch requires it.
+ */
+ force_ipi |= tp[i].force_ipi;
tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
}
- text_poke_sync();
+ sync_fn = force_ipi ? text_poke_sync : text_poke_sync_deferrable;
+
+ sync_fn();
/*
* Second step: update all but the first byte of the patched range.
@@ -2357,7 +2382,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
* not necessary and we'd be safe even without it. But
* better safe than sorry (plus there's not only Intel).
*/
- text_poke_sync();
+ sync_fn();
}
/*
@@ -2378,7 +2403,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
}
if (do_sync)
- text_poke_sync();
+ sync_fn();
/*
* Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 72e6a45e7ec24..c2fd2578fd5fc 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -817,7 +817,7 @@ void arch_arm_kprobe(struct kprobe *p)
u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
}
@@ -827,7 +827,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
text_poke(p->addr, &p->opcode, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
}
void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 36d6809c6c9e1..b2ce4d9c3ba56 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();
text_poke(addr + INT3_INSN_SIZE,
new + INT3_INSN_SIZE,
JMP32_INSN_SIZE - INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 837450b6e882f..00e71ad30c01d 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -191,7 +191,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
write, apply);
if (!early) {
- text_poke_sync();
+ text_poke_sync_deferrable();
mutex_unlock(&text_mutex);
}
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index fb74db8876dd2..13fc97b395030 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,12 @@
#include <linux/bitops.h>
enum {
- CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_SYNC_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};
enum ct_work {
- CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (11 preceding siblings ...)
2024-11-19 15:34 ` [RFC PATCH v3 12/15] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2024-11-19 15:35 ` Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 14/15] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
` (2 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:35 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.
Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't
try to leverage INVPCID. To achieve this:
o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as
"only" __no_profile.
o Make invalidate_user_asid() __always_inline
XXX: what about paravirt? Should these be instead new operations made
available through pv_ops.mmu.*? x86_64_start_kernel() uses
__native_tlb_flush_global() regardless of paravirt, so I'm thinking the
paravirt variants are more optimizations than hard requirements?
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/context_tracking_work.h | 4 +++
arch/x86/include/asm/special_insns.h | 1 +
arch/x86/include/asm/tlbflush.h | 16 ++++++++++--
arch/x86/kernel/cpu/common.c | 6 ++++-
arch/x86/mm/tlb.c | 26 ++++++++++++++++++--
include/linux/context_tracking_work.h | 2 ++
6 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
#define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h>
+#include <asm/tlbflush.h>
static __always_inline void arch_context_tracking_work(int work)
{
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work)
case CONTEXT_WORK_SYNC:
sync_core();
break;
+ case CONTEXT_WORK_TLBI:
+ __flush_tlb_all_noinstr();
+ break;
}
}
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index aec6e2d3aa1d5..b97157a69d48e 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -74,6 +74,7 @@ static inline unsigned long native_read_cr4(void)
return val;
}
+void native_write_cr4_noinstr(unsigned long val);
void native_write_cr4(unsigned long val);
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b80..a653b5f47f0e6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -18,6 +18,7 @@
DECLARE_PER_CPU(u64, tlbstate_untag_mask);
void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);
#define TLB_FLUSH_ALL -1UL
#define TLB_GENERATION_INVALID 0
@@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask)
#endif
#endif /* !MODULE */
+#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \
+ native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \
+ native_write_cr4##suffix(cr4)
+#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4)
+#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4)
+
static inline void __native_tlb_flush_global(unsigned long cr4)
{
- native_write_cr4(cr4 ^ X86_CR4_PGE);
- native_write_cr4(cr4);
+ NATIVE_TLB_FLUSH_GLOBAL(cr4);
}
+
+static inline void __native_tlb_flush_global_noinstr(unsigned long cr4)
+{
+ NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4);
+}
+
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a5f221ea56888..a84bb8511650b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -424,7 +424,7 @@ void native_write_cr0(unsigned long val)
}
EXPORT_SYMBOL(native_write_cr0);
-void __no_profile native_write_cr4(unsigned long val)
+noinstr void native_write_cr4_noinstr(unsigned long val)
{
unsigned long bits_changed = 0;
@@ -442,6 +442,10 @@ void __no_profile native_write_cr4(unsigned long val)
bits_changed);
}
}
+void native_write_cr4(unsigned long val)
+{
+ native_write_cr4_noinstr(val);
+}
#if IS_MODULE(CONFIG_LKDTM)
EXPORT_SYMBOL_GPL(native_write_cr4);
#endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1b787d8..973a4ab3f53b3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -256,7 +256,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
*
* See SWITCH_TO_USER_CR3.
*/
-static inline void invalidate_user_asid(u16 asid)
+static __always_inline void invalidate_user_asid(u16 asid)
{
/* There is no user ASID if address space separation is off */
if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION))
@@ -1198,7 +1198,7 @@ STATIC_NOPV void native_flush_tlb_global(void)
/*
* Flush the entire current user mapping
*/
-STATIC_NOPV void native_flush_tlb_local(void)
+static noinstr void native_flush_tlb_local_noinstr(void)
{
/*
* Preemption or interrupts must be disabled to protect the access
@@ -1213,6 +1213,11 @@ STATIC_NOPV void native_flush_tlb_local(void)
native_write_cr3(__native_read_cr3());
}
+STATIC_NOPV void native_flush_tlb_local(void)
+{
+ native_flush_tlb_local_noinstr();
+}
+
void flush_tlb_local(void)
{
__flush_tlb_local();
@@ -1240,6 +1245,23 @@ void __flush_tlb_all(void)
}
EXPORT_SYMBOL_GPL(__flush_tlb_all);
+void noinstr __flush_tlb_all_noinstr(void)
+{
+ /*
+ * This is for invocation in early entry code that cannot be
+ * instrumented. A RMW to CR4 works for most cases, but relies on
+ * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+ * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+ * do the CR3 write.
+ *
+ * XXX: this gives paravirt the finger.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_PGE))
+ __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
+ else
+ native_flush_tlb_local_noinstr();
+}
+
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 13fc97b395030..47d5ced39a43a 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,11 +6,13 @@
enum {
CONTEXT_WORK_SYNC_OFFSET,
+ CONTEXT_WORK_TLBI_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};
enum ct_work {
CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
+ CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 14/15] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (12 preceding siblings ...)
2024-11-19 15:35 ` [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
@ 2024-11-19 15:35 ` Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 15/15] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider
2024-11-19 16:45 ` [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Steven Rostedt
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:35 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
===================================
This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.
Vmalloc uses are, as reported by vmallocinfo:
$ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq
__pci_enable_msix_range+0x32b/0x560
acpi_os_map_iomem+0x22d/0x250
bpf_prog_alloc_no_stats+0x34/0x140
fork_idle+0x79/0x120
gen_pool_add_owner+0x3e/0xb0 ?
hpet_enable+0xbf/0x470
irq_init_percpu_irqstack+0x129/0x160
kernel_clone+0xab/0x3b0
memremap+0x164/0x330
n_tty_open+0x18/0xa0
pcpu_create_chunk+0x4e/0x1b0
pcpu_create_chunk+0x75/0x1b0
pcpu_get_vm_areas+0x0/0x1100
unpurged
vp_modern_map_capability+0x140/0x270
zisofs_init+0x16/0x30
I've categorized these as:
a) Device or percpu mappings
For these to be unmapped, the device (or CPU) has to be removed and an
eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry
happens before handling the IRQ itself, per irqentry_enter() usage.
__pci_enable_msix_range()
acpi_os_map_iomem()
irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!)
memremap()
n_tty_open()
pcpu_create_chunk()
pcpu_get_vm_areas()
vp_modern_map_capability()
b) CONFIG_VMAP_STACK
fork_idle() & kernel_clone()
vmalloc'd kernel stacks are AFAICT a safe example, as a task running in
userspace needs to enter kernelspace to execute do_exit() before its
stack can be vfree'd.
c) Non-issues
bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF!
hpet_enable() - hpet_clear_mapping() is only called if __init function
fails, no runtime worries
zisofs_init () - used for zisofs block decompression, that's way past
context tracking entry
d) I'm not sure, have a look?
gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which
again I wouldn't expect to be accessed before
context tracking entry.
Changes
======
Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.
Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 23 +++++++++++++++++++---
mm/vmalloc.c | 35 ++++++++++++++++++++++++++++-----
3 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a653b5f47f0e6..d89345c85fa9c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -249,6 +249,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 973a4ab3f53b3..bf6ff16a1a523 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <linux/task_work.h>
#include <linux/mmu_notifier.h>
#include <linux/mmu_context.h>
+#include <linux/context_tracking.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1041,6 +1042,11 @@ static void do_flush_tlb_all(void *info)
__flush_tlb_all();
}
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+ return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI);
+}
+
void flush_tlb_all(void)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1057,12 +1063,13 @@ static void do_kernel_range_flush(void *info)
flush_tlb_one_kernel(addr);
}
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
{
/* Balance as user space task's flush, a bit conservative */
if (end == TLB_FLUSH_ALL ||
(end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
- on_each_cpu(do_flush_tlb_all, NULL, 1);
+ on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
} else {
struct flush_tlb_info *info;
@@ -1070,13 +1077,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
info = get_flush_tlb_info(NULL, start, end, 0, false,
TLB_GENERATION_INVALID);
- on_each_cpu(do_kernel_range_flush, info, 1);
+ on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
put_flush_tlb_info();
preempt_enable();
}
}
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 634162271c004..02838c515ce2c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -467,6 +467,31 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
__vunmap_range_noflush(start, end);
}
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a
+ * minefield.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+ flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
/**
* vunmap_range - unmap kernel virtual addresses
* @addr: start of the VM area to unmap
@@ -480,7 +505,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
{
flush_cache_vunmap(addr, end);
vunmap_range_noflush(addr, end);
- flush_tlb_kernel_range(addr, end);
+ flush_tlb_kernel_range_deferrable(addr, end);
}
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2265,7 +2290,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
nr_purge_nodes = cpumask_weight(&purge_nodes);
if (nr_purge_nodes > 0) {
- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
/* One extra worker is per a lazy_max_pages() full set minus one. */
nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2368,7 +2393,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
flush_cache_vunmap(va->va_start, va->va_end);
vunmap_range_noflush(va->va_start, va->va_end);
if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(va->va_start, va->va_end);
+ flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
free_vmap_area_noflush(va);
}
@@ -2816,7 +2841,7 @@ static void vb_free(unsigned long addr, unsigned long size)
vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(addr, addr + size);
+ flush_tlb_kernel_range_deferrable(addr, addr + size);
spin_lock(&vb->lock);
@@ -2881,7 +2906,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
free_purged_blocks(&purge_list);
if (!__purge_vmap_area_lazy(start, end, false) && flush)
- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
mutex_unlock(&vmap_purge_lock);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH v3 15/15] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (13 preceding siblings ...)
2024-11-19 15:35 ` [RFC PATCH v3 14/15] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
@ 2024-11-19 15:35 ` Valentin Schneider
2024-11-19 16:45 ` [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Steven Rostedt
15 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 15:35 UTC (permalink / raw)
To: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest
Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
transitions:
ct_idle_enter()
ct_kernel_exit()
ct_state_inc_clear_work()
ct_idle_exit()
ct_kernel_enter()
ct_work_flush()
With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just
ct_state_inc() and ct_work_flush() is a no-op. However, making them be
functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to
leverage IPI deferral to keep idle CPUs idle longer.
Having this enabled for NO_HZ_IDLE is a different argument than for having
it for NO_HZ_FULL (power savings vs latency/performance), but the backing
mechanism is identical.
Add a default-no option to enable IPI deferral with NO_HZ_IDLE.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/time/Kconfig | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 04efc2b605823..a3cb903a4e022 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -188,9 +188,23 @@ config CONTEXT_TRACKING_USER_FORCE
config CONTEXT_TRACKING_WORK
bool
- depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+ depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || CONTEXT_TRACKING_WORK_IDLE)
default y
+config CONTEXT_TRACKING_WORK_IDLE
+ bool
+ depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && !CONTEXT_TRACKING_USER
+ default n
+ help
+ This option enables deferral of some IPIs when they are targeted at CPUs
+ that are idle. This can help keep CPUs idle longer, but induces some
+ extra overhead to idle <-> kernel transitions and to IPI sending.
+
+ Say Y if the power improvements are worth more to you than the added
+ overheads.
+
+ Say N otherwise.
+
config NO_HZ
bool "Old Idle dynticks config"
help
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
` (14 preceding siblings ...)
2024-11-19 15:35 ` [RFC PATCH v3 15/15] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider
@ 2024-11-19 16:45 ` Steven Rostedt
2024-11-19 22:51 ` Valentin Schneider
15 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2024-11-19 16:45 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, 19 Nov 2024 16:34:47 +0100
Valentin Schneider <vschneid@redhat.com> wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
FYI,
Sending a patch series at the start of the merge window is likely going to
get ignored.
Care to resend after rc1 is released? Or at least ping about it.
-- Steve
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[]
2024-11-19 15:34 ` [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
@ 2024-11-19 20:38 ` Josh Poimboeuf
0 siblings, 0 replies; 24+ messages in thread
From: Josh Poimboeuf @ 2024-11-19 20:38 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, Nov 19, 2024 at 04:34:48PM +0100, Valentin Schneider wrote:
> call_dest_name() does not get passed the file pointer of validate_call(),
> which means its invocation of insn_reloc() will always return NULL. Make it
> take a file pointer.
>
> While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
> otherwise it gets the pv_ops[] offset wrong.
>
> Fabricating an intentional warning shows the change; previously:
>
> vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
>
> now:
>
> vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
--
Josh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls
2024-11-19 15:34 ` [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2024-11-19 20:38 ` Josh Poimboeuf
0 siblings, 0 replies; 24+ messages in thread
From: Josh Poimboeuf @ 2024-11-19 20:38 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, Nov 19, 2024 at 04:34:49PM +0100, Valentin Schneider wrote:
> I had to look into objtool itself to understand what this warning was
> about; make it more explicit.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
--
Josh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition
2024-11-19 16:45 ` [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Steven Rostedt
@ 2024-11-19 22:51 ` Valentin Schneider
0 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2024-11-19 22:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On 19/11/24 11:45, Steven Rostedt wrote:
> On Tue, 19 Nov 2024 16:34:47 +0100
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>> Context
>> =======
>>
>> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
>> pure-userspace application get regularly interrupted by IPIs sent from
>> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
>> leading to various on_each_cpu() calls, e.g.:
>
> FYI,
>
> Sending a patch series at the start of the merge window is likely going to
> get ignored.
>
> Care to resend after rc1 is released? Or at least ping about it.
>
Heh, I was so far down "stop rewriting changelogs for the Nth time, get it
in a decent shape and send it out" that I forgot about this "small little
detail". Thanks for the reminder.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 06/15] jump_label: Add forceful jump label type
2024-11-19 15:34 ` [RFC PATCH v3 06/15] jump_label: Add forceful jump label type Valentin Schneider
@ 2024-11-19 23:39 ` Josh Poimboeuf
2024-11-20 0:05 ` Josh Poimboeuf
1 sibling, 0 replies; 24+ messages in thread
From: Josh Poimboeuf @ 2024-11-19 23:39 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
> Later commits will cause objtool to warn about non __ro_after_init static
> keys being used in .noinstr sections in order to safely defer instruction
> patching IPIs targeted at NOHZ_FULL CPUs.
Don't we need similar checking for static calls?
> Two such keys currently exist: mds_idle_clear and __sched_clock_stable,
> which can both be modified at runtime.
Not sure if feasible, but it sure would be a lot simpler to just make
"no noinstr patching" a hard rule and then convert the above keys (or at
least their noinstr-specific usage) to regular branches.
Then "no noinstr patching" could be unilaterally enforced in
text_poke_bp().
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> index f5a2727ca4a9a..93e729545b941 100644
> --- a/include/linux/jump_label.h
> +++ b/include/linux/jump_label.h
> @@ -200,7 +200,8 @@ struct module;
> #define JUMP_TYPE_FALSE 0UL
> #define JUMP_TYPE_TRUE 1UL
> #define JUMP_TYPE_LINKED 2UL
> -#define JUMP_TYPE_MASK 3UL
> +#define JUMP_TYPE_FORCEFUL 4UL
JUMP_TYPE_NOINSTR_ALLOWED ?
--
Josh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 06/15] jump_label: Add forceful jump label type
2024-11-19 15:34 ` [RFC PATCH v3 06/15] jump_label: Add forceful jump label type Valentin Schneider
2024-11-19 23:39 ` Josh Poimboeuf
@ 2024-11-20 0:05 ` Josh Poimboeuf
2024-11-20 10:22 ` Peter Zijlstra
1 sibling, 1 reply; 24+ messages in thread
From: Josh Poimboeuf @ 2024-11-20 0:05 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
> +++ b/include/linux/jump_label.h
> @@ -200,7 +200,8 @@ struct module;
> #define JUMP_TYPE_FALSE 0UL
> #define JUMP_TYPE_TRUE 1UL
> #define JUMP_TYPE_LINKED 2UL
> -#define JUMP_TYPE_MASK 3UL
> +#define JUMP_TYPE_FORCEFUL 4UL
> +#define JUMP_TYPE_MASK 7UL
Hm, I don't think we can (ab)use this pointer bit on 32-bit arches, as
the address could be 4 byte aligned?
--
Josh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 06/15] jump_label: Add forceful jump label type
2024-11-20 0:05 ` Josh Poimboeuf
@ 2024-11-20 10:22 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-11-20 10:22 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Valentin Schneider, linux-kernel, linux-doc, kvm, linux-mm, bpf,
x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Jason Baron, Kees Cook, Sami Tolvanen,
Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
On Tue, Nov 19, 2024 at 04:05:32PM -0800, Josh Poimboeuf wrote:
> On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
> > +++ b/include/linux/jump_label.h
> > @@ -200,7 +200,8 @@ struct module;
> > #define JUMP_TYPE_FALSE 0UL
> > #define JUMP_TYPE_TRUE 1UL
> > #define JUMP_TYPE_LINKED 2UL
> > -#define JUMP_TYPE_MASK 3UL
> > +#define JUMP_TYPE_FORCEFUL 4UL
> > +#define JUMP_TYPE_MASK 7UL
>
> Hm, I don't think we can (ab)use this pointer bit on 32-bit arches, as
> the address could be 4 byte aligned?
Right, you can force the alignment of the thing, workqueues do similar
hacks to get more bits.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure
2024-11-19 15:34 ` [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2024-11-20 10:54 ` Frederic Weisbecker
0 siblings, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2024-11-20 10:54 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-doc, kvm, linux-mm, bpf, x86, rcu,
linux-kselftest, Nicolas Saenz Julienne, Steven Rostedt,
Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
Dionna Glaze, Thomas Weißschuh, Juri Lelli, Marcelo Tosatti,
Yair Podemsky, Daniel Wagner, Petr Tesarik
Le Tue, Nov 19, 2024 at 04:34:58PM +0100, Valentin Schneider a écrit :
> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> + unsigned int old;
> + bool ret = false;
> +
> + preempt_disable();
> +
> + old = atomic_read(&ct->state);
> + /*
> + * Try setting the work until either
> + * - the target CPU has entered kernelspace
> + * - the work has been set
> + */
> + do {
> + ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
> + } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL));
> +
> + preempt_enable();
> + return ret;
Does it ignore the IPI even if:
(ret && (old & CT_STATE_MASK) == CT_STATE_KERNEL))
?
And what about CT_STATE_IDLE?
Is the work ignored in those two cases?
But would it be cleaner to never set the work if the target is elsewhere
than CT_STATE_USER. So you don't need to clear the work on kernel exit
but rather on kernel entry.
That is:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
{
struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
unsigned int old;
bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* Start with our best wishes */
old &= ~CT_STATE_MASK;
old |= CT_STATE_USER
/*
* Try setting the work until either
* - the target CPU has exited userspace
* - the work has been set
*/
do {
ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
} while (!ret && ((old & CT_STATE_MASK) == CT_STATE_USER));
preempt_enable();
return ret;
}
Thanks.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-11-20 10:54 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 15:34 [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 01/15] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2024-11-19 20:38 ` Josh Poimboeuf
2024-11-19 15:34 ` [RFC PATCH v3 02/15] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2024-11-19 20:38 ` Josh Poimboeuf
2024-11-19 15:34 ` [RFC PATCH v3 03/15] sched/clock: Make sched_clock_running __ro_after_init Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 04/15] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 05/15] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 06/15] jump_label: Add forceful jump label type Valentin Schneider
2024-11-19 23:39 ` Josh Poimboeuf
2024-11-20 0:05 ` Josh Poimboeuf
2024-11-20 10:22 ` Peter Zijlstra
2024-11-19 15:34 ` [RFC PATCH v3 07/15] x86/speculation/mds: Make mds_idle_clear forceful Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 08/15] sched/clock, x86: Make __sched_clock_stable forceful Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 09/15] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 10/15] x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels Valentin Schneider
2024-11-19 15:34 ` [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2024-11-20 10:54 ` Frederic Weisbecker
2024-11-19 15:34 ` [RFC PATCH v3 12/15] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 14/15] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
2024-11-19 15:35 ` [RFC PATCH v3 15/15] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE Valentin Schneider
2024-11-19 16:45 ` [RFC PATCH v3 00/15] context_tracking,x86: Defer some IPIs until a user->kernel transition Steven Rostedt
2024-11-19 22:51 ` Valentin Schneider
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox