* [PATCH v6 1/3] tracing: perf: Have perf tracepoint callbacks always disable preemption
2026-01-26 23:11 [PATCH v6 0/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
@ 2026-01-26 23:11 ` Steven Rostedt
2026-01-26 23:11 ` [PATCH v6 2/3] bpf: Have __bpf_trace_run() use rcu_read_lock_dont_migrate() Steven Rostedt
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2026-01-26 23:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Paul E. McKenney, Sebastian Andrzej Siewior, Alexei Starovoitov
From: Steven Rostedt <rostedt@goodmis.org>
In preparation to convert protection of tracepoints from being protected
by a preempt disabled section to being protected by SRCU, have all the
perf callbacks disable preemption as perf expects preemption to be
disabled when processing tracepoints.
While at it, convert the perf system call callback preempt_disable() to a
guard(preempt).
Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/
Link: https://patch.msgid.link/20260108220550.2f6638f3@fedora
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/trace/perf.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/trace/perf.h b/include/trace/perf.h
index a1754b73a8f5..348ad1d9b556 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto) \
u64 __count __attribute__((unused)); \
struct task_struct *__task __attribute__((unused)); \
\
+ guard(preempt_notrace)(); \
do_perf_trace_##call(__data, args); \
}
@@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto) \
struct task_struct *__task __attribute__((unused)); \
\
might_fault(); \
- preempt_disable_notrace(); \
+ guard(preempt_notrace)(); \
do_perf_trace_##call(__data, args); \
- preempt_enable_notrace(); \
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 2/3] bpf: Have __bpf_trace_run() use rcu_read_lock_dont_migrate()
2026-01-26 23:11 [PATCH v6 0/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
2026-01-26 23:11 ` [PATCH v6 1/3] tracing: perf: Have perf tracepoint callbacks always disable preemption Steven Rostedt
@ 2026-01-26 23:11 ` Steven Rostedt
2026-01-26 23:11 ` [PATCH v6 3/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
2026-01-27 2:39 ` [PATCH v6 0/3] " Steven Rostedt
3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2026-01-26 23:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Paul E. McKenney, Sebastian Andrzej Siewior, Alexei Starovoitov,
Alexei Starovoitov
From: Steven Rostedt <rostedt@goodmis.org>
In order to switch the protection of tracepoint callbacks from
preempt_disable() to srcu_read_lock_fast() the BPF callback from
tracepoints needs to have migration prevention as the BPF programs expect
to stay on the same CPU as they execute. Put together the RCU protection
with migration prevention and use rcu_read_lock_dont_migrate() in
__bpf_trace_run(). This will allow tracepoints callbacks to be
preemptible.
Link: https://lore.kernel.org/all/CAADnVQKvY026HSFGOsavJppm3-Ajm-VsLzY-OeFUe+BaKMRnDg@mail.gmail.com/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/bpf_trace.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe28d86f7c35..abbf0177ad20 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2062,7 +2062,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
struct bpf_run_ctx *old_run_ctx;
struct bpf_trace_run_ctx run_ctx;
- cant_sleep();
+ rcu_read_lock_dont_migrate();
if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
bpf_prog_inc_misses_counter(prog);
goto out;
@@ -2071,13 +2071,12 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
run_ctx.bpf_cookie = link->cookie;
old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
- rcu_read_lock();
(void) bpf_prog_run(prog, args);
- rcu_read_unlock();
bpf_reset_run_ctx(old_run_ctx);
out:
this_cpu_dec(*(prog->active));
+ rcu_read_unlock_migrate();
}
#define UNPACK(...) __VA_ARGS__
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 3/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
2026-01-26 23:11 [PATCH v6 0/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
2026-01-26 23:11 ` [PATCH v6 1/3] tracing: perf: Have perf tracepoint callbacks always disable preemption Steven Rostedt
2026-01-26 23:11 ` [PATCH v6 2/3] bpf: Have __bpf_trace_run() use rcu_read_lock_dont_migrate() Steven Rostedt
@ 2026-01-26 23:11 ` Steven Rostedt
2026-01-27 2:39 ` [PATCH v6 0/3] " Steven Rostedt
3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2026-01-26 23:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Paul E. McKenney, Sebastian Andrzej Siewior, Alexei Starovoitov
From: Steven Rostedt <rostedt@goodmis.org>
The current use of guard(preempt_notrace)() within __DECLARE_TRACE()
to protect invocation of __DO_TRACE_CALL() means that BPF programs
attached to tracepoints are non-preemptible. This is unhelpful in
real-time systems, whose users apparently wish to use BPF while also
achieving low latencies. (Who knew?)
One option would be to use preemptible RCU, but this introduces
many opportunities for infinite recursion, which many consider to
be counterproductive, especially given the relatively small stacks
provided by the Linux kernel. These opportunities could be shut down
by sufficiently energetic duplication of code, but this sort of thing
is considered impolite in some circles.
Therefore, use the shiny new SRCU-fast API, which provides somewhat faster
readers than those of preemptible RCU, at least on Paul E. McKenney's
laptop, where task_struct access is more expensive than access to per-CPU
variables. And SRCU-fast provides way faster readers than does SRCU,
courtesy of being able to avoid the read-side use of smp_mb(). Also,
it is quite straightforward to create srcu_read_{,un}lock_fast_notrace()
functions.
Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/
Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://patch.msgid.link/20260108220550.2f6638f3@fedora
- Just change from preempt_disable() to srcu_fast() always
Do not do anything different for PREEMPT_RT.
Now that BPF disables migration directly, do not have tracepoints
disable migration in its code.
include/linux/tracepoint.h | 9 +++++----
include/trace/trace_events.h | 4 ++--
kernel/tracepoint.c | 18 ++++++++++++++----
3 files changed, 21 insertions(+), 10 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 8a56f3278b1b..22ca1c8b54f3 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -108,14 +108,15 @@ void for_each_tracepoint_in_module(struct module *mod,
* An alternative is to use the following for batch reclaim associated
* with a given tracepoint:
*
- * - tracepoint_is_faultable() == false: call_rcu()
+ * - tracepoint_is_faultable() == false: call_srcu()
* - tracepoint_is_faultable() == true: call_rcu_tasks_trace()
*/
#ifdef CONFIG_TRACEPOINTS
+extern struct srcu_struct tracepoint_srcu;
static inline void tracepoint_synchronize_unregister(void)
{
synchronize_rcu_tasks_trace();
- synchronize_rcu();
+ synchronize_srcu(&tracepoint_srcu);
}
static inline bool tracepoint_is_faultable(struct tracepoint *tp)
{
@@ -275,13 +276,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return static_branch_unlikely(&__tracepoint_##name.key);\
}
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
__DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), PARAMS(data_proto)) \
static inline void __do_trace_##name(proto) \
{ \
TRACEPOINT_CHECK(name) \
if (cond) { \
- guard(preempt_notrace)(); \
+ guard(srcu_fast_notrace)(&tracepoint_srcu); \
__DO_TRACE_CALL(name, TP_ARGS(args)); \
} \
} \
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..fbc07d353be6 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -436,6 +436,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
static notrace void \
trace_event_raw_event_##call(void *__data, proto) \
{ \
+ guard(preempt_notrace)(); \
do_trace_event_raw_event_##call(__data, args); \
}
@@ -447,9 +448,8 @@ static notrace void \
trace_event_raw_event_##call(void *__data, proto) \
{ \
might_fault(); \
- preempt_disable_notrace(); \
+ guard(preempt_notrace)(); \
do_trace_event_raw_event_##call(__data, args); \
- preempt_enable_notrace(); \
}
/*
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 62719d2941c9..fd2ee879815c 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -34,9 +34,13 @@ enum tp_transition_sync {
struct tp_transition_snapshot {
unsigned long rcu;
+ unsigned long srcu_gp;
bool ongoing;
};
+DEFINE_SRCU_FAST(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu);
+
/* Protected by tracepoints_mutex */
static struct tp_transition_snapshot tp_transition_snapshot[_NR_TP_TRANSITION_SYNC];
@@ -46,6 +50,7 @@ static void tp_rcu_get_state(enum tp_transition_sync sync)
/* Keep the latest get_state snapshot. */
snapshot->rcu = get_state_synchronize_rcu();
+ snapshot->srcu_gp = start_poll_synchronize_srcu(&tracepoint_srcu);
snapshot->ongoing = true;
}
@@ -56,6 +61,8 @@ static void tp_rcu_cond_sync(enum tp_transition_sync sync)
if (!snapshot->ongoing)
return;
cond_synchronize_rcu(snapshot->rcu);
+ if (!poll_state_synchronize_srcu(&tracepoint_srcu, snapshot->srcu_gp))
+ synchronize_srcu(&tracepoint_srcu);
snapshot->ongoing = false;
}
@@ -112,10 +119,13 @@ static inline void release_probes(struct tracepoint *tp, struct tracepoint_func
struct tp_probes *tp_probes = container_of(old,
struct tp_probes, probes[0]);
- if (tracepoint_is_faultable(tp))
- call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
- else
- call_rcu(&tp_probes->rcu, rcu_free_old_probes);
+ if (tracepoint_is_faultable(tp)) {
+ call_rcu_tasks_trace(&tp_probes->rcu,
+ rcu_free_old_probes);
+ } else {
+ call_srcu(&tracepoint_srcu, &tp_probes->rcu,
+ rcu_free_old_probes);
+ }
}
}
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v6 0/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast
2026-01-26 23:11 [PATCH v6 0/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
` (2 preceding siblings ...)
2026-01-26 23:11 ` [PATCH v6 3/3] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast Steven Rostedt
@ 2026-01-27 2:39 ` Steven Rostedt
2026-01-27 23:18 ` Paul E. McKenney
3 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2026-01-27 2:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Paul E. McKenney, Sebastian Andrzej Siewior, Alexei Starovoitov
On Mon, 26 Jan 2026 18:11:45 -0500
Steven Rostedt <rostedt@kernel.org> wrote:
> The current use of guard(preempt_notrace)() within __DECLARE_TRACE()
> to protect invocation of __DO_TRACE_CALL() means that BPF programs
> attached to tracepoints are non-preemptible. This is unhelpful in
> real-time systems, whose users apparently wish to use BPF while also
> achieving low latencies.
>
> Change the protection of tracepoints to use fast_srcu() instead.
> This will allow the callbacks to be able to be preempted. This also
> means that the callbacks themselves need to be able to handle this
> new found preemption ability.
>
> For perf, add a guard(preempt) inside its handler too keep the old behavior
> of perf events being called with preemption disabled.
>
> For BPF, add a migrate_disable() to its handler. Actually, just replace
> the rcu_read_lock() with rcu_read_lock_dont_migrate() and make it
> cover more of the BPF callback handler.
My tests just triggered this, so I'm removing them from my queue for now.
-- Steve
[ 204.194772] ------------[ cut here ]------------
[ 204.194789] WARNING: kernel/rcu/srcutree.c:792 at __srcu_check_read_flavor+0x5c/0xb0, CPU#1: swapper/1/0
[ 204.194800] Modules linked in:
[ 204.194817] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.19.0-rc7-test-00018-g2c774d6ad074-dirty #32 PREEMPT(voluntary)
[ 204.194821] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 204.194824] RIP: 0010:__srcu_check_read_flavor+0x5c/0xb0
[ 204.194829] Code: 84 c9 74 19 39 f1 74 45 0f 0b 85 c0 74 2e 39 c1 74 45 0f 0b 39 f0 75 3f c3 cc cc cc cc 85 c0 74 16 83 fe 04 75 ee 0f 0b eb ea <0f> 0b 8d 46 ff 85 f0 74 ba 0f 0b eb b6 83
fe 04 74 3a 31 c0 f0 0f
[ 204.194832] RSP: 0018:fffffe4c48325b50 EFLAGS: 00010002
[ 204.194835] RAX: 0000000000000001 RBX: ffffffff8791e5a0 RCX: 0000000000000000
[ 204.194836] RDX: 00000000ffffffff RSI: 0000000000000004 RDI: ffffffff879f1180
[ 204.194838] RBP: ffff8e6453fd2000 R08: 0000000000000001 R09: 0000000000000000
[ 204.194839] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[ 204.194840] R13: fffffe4c48325ef8 R14: ffffffff85eeae93 R15: ffff8e6453906900
[ 204.194842] FS: 0000000000000000(0000) GS:ffff8e6533593000(0000) knlGS:0000000000000000
[ 204.194844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 204.194845] CR2: 000055d1e7cf8cc0 CR3: 000000010b0cc004 CR4: 0000000000172ef0
[ 204.194850] Call Trace:
[ 204.194866] <NMI>
[ 204.194868] lock_release+0x215/0x320
[ 204.194886] ? arch_perf_update_userpage+0x6c/0xf0
[ 204.195214] perf_event_update_userpage+0x158/0x2e0
[ 204.195538] x86_perf_event_set_period+0xc1/0x180
[ 204.195811] handle_pmi_common+0x1ac/0x450
[ 204.198605] ? __get_next_timer_interrupt+0x185/0x370
[ 204.198914] intel_pmu_handle_irq+0x10e/0x510
[ 204.199032] ? nmi_handle.part.0+0x30/0x270
[ 204.199197] ? __get_next_timer_interrupt+0x185/0x370
[ 204.199404] perf_event_nmi_handler+0x34/0x60
[ 204.199523] nmi_handle.part.0+0xc9/0x270
^ permalink raw reply [flat|nested] 9+ messages in thread