[RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU
@ 2025-06-13 15:22 Sebastian Andrzej Siewior
  2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-13 15:22 UTC (permalink / raw)
  To: linux-rt-devel, rcu, linux-trace-kernel
  Cc: Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Mathieu Desnoyers,
	Neeraj Upadhyay, Steven Rostedt, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, Sebastian Andrzej Siewior

This is a follow-up on
	https://lore.kernel.org/all/20241206120709.736f943e@gandalf.local.home/T/#u

The problem is that trace points disable preemption due to sched-RCU.
The arguments passed to tracepoints may used locks (yes, they should
not) and eBPF programs can be attached to tracepoints which may use
locks which become sleeping locks on PREEMPT_RT.

Mathieu said regarding the need for preempt_notrace:
| There are a few things to consider here about the constraints of the
| callsites where the tracepoints are inserted. In general, those need to
| be:
| 
| - NMI-safe
| - notrace
| - usable from the scheduler (with rq lock held)
| - usable to trace the RCU implementation

This is covered by the here suggested rcu_read_lock_notrace().

Mathieu's suggested steps were:

| Well the first step would be to introduce a rcu_read_lock/unlock_notrace.

Patch #1

| This solves the problem at the tracepoint level, but requires that we
| initially move the preempt disable to the tracer callbacks. Then we
| can figure out within each tracer what needs to be done to further
| reduce the preempt off critical section.

Here I am stuck. Patch #2 is probably not correct due to "move the
preempt disable to the tracer callbacks". I didn't figure out where they
are…

Sebastian Andrzej Siewior (2):
  rcu: Add rcu_read_lock_notrace()
  trace: Use rcu_read_lock() instead preempt_disable()

 include/linux/rcupdate.h    | 41 +++++++++++++++++++++++++++++++++++++
 include/linux/tracepoint.h  |  2 +-
 kernel/rcu/tree_plugin.h    | 14 +++++++++++++
 kernel/trace/trace_events.c |  8 +-------
 4 files changed, 57 insertions(+), 8 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
@ 2025-06-13 15:22 ` Sebastian Andrzej Siewior
  2025-06-18 17:21   ` Boqun Feng
  2025-06-13 15:22 ` [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable() Sebastian Andrzej Siewior
  2025-06-13 15:38 ` [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Mathieu Desnoyers
  2 siblings, 1 reply; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-13 15:22 UTC (permalink / raw)
  To: linux-rt-devel, rcu, linux-trace-kernel
  Cc: Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Mathieu Desnoyers,
	Neeraj Upadhyay, Steven Rostedt, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, Sebastian Andrzej Siewior

rcu_read_lock_notrace() follows preempt_disable_notrace(). The idea is
provide a rcu_read_lock() version so that the preempt_disable_notrace()
user (which use RCU-sched under the hood) can migrate to preemptible
RCU.
The first user should be tracing (tracepoint) which is using the
_notrace variant.

rcu_read_lock_notrace() is a slim version of rcu_read_lock(). It simply
increments/ decrements the counter. It does not emit any warnings if
RCU_NEST_PMAX is exceeded or patricipates in STRICT_GRACE_PERIOD. It
also does not participate in rcu_read_unlock_special() as it would if
invoke from NMI.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/rcupdate.h | 41 ++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/tree_plugin.h | 14 ++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 120536f4c6eb1..0de7e68a2411a 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -71,6 +71,8 @@ static inline bool same_state_synchronize_rcu(unsigned long oldstate1, unsigned
 
 void __rcu_read_lock(void);
 void __rcu_read_unlock(void);
+void __rcu_read_lock_notrace(void);
+void __rcu_read_unlock_notrace(void);
 
 /*
  * Defined as a macro as it is a very low level header included from
@@ -93,6 +95,11 @@ static inline void __rcu_read_lock(void)
 	preempt_disable();
 }
 
+static inline void __rcu_read_lock_notrace(void)
+{
+	preempt_disable_notrace();
+}
+
 static inline void __rcu_read_unlock(void)
 {
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
@@ -100,6 +107,11 @@ static inline void __rcu_read_unlock(void)
 	preempt_enable();
 }
 
+static inline void __rcu_read_unlock_notrace(void)
+{
+	preempt_enable_notrace();
+}
+
 static inline int rcu_preempt_depth(void)
 {
 	return 0;
@@ -843,6 +855,16 @@ static __always_inline void rcu_read_lock(void)
 			 "rcu_read_lock() used illegally while idle");
 }
 
+/*
+ * Used by tracing: cannot be traced, NMI safe, usable from the scheduler,
+ * usable to trace the RCU implementation.
+ */
+static __always_inline void rcu_read_lock_notrace(void)
+{
+	__rcu_read_lock_notrace();
+	__acquire(RCU);
+}
+
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
  * way for writers to lock out RCU readers.  This is a feature, not
@@ -873,6 +895,11 @@ static inline void rcu_read_unlock(void)
 	__rcu_read_unlock();
 }
 
+static __always_inline void rcu_read_unlock_notrace(void)
+{
+	__release(RCU);
+	__rcu_read_unlock_notrace();
+}
 /**
  * rcu_read_lock_bh() - mark the beginning of an RCU-bh critical section
  *
@@ -1166,4 +1193,18 @@ DEFINE_LOCK_GUARD_0(rcu,
 	} while (0),
 	rcu_read_unlock())
 
+DEFINE_LOCK_GUARD_0(rcu_notrace,
+	do {
+		rcu_read_lock_notrace();
+		/*
+		 * sparse doesn't call the cleanup function,
+		 * so just release immediately and don't track
+		 * the context. We don't need to anyway, since
+		 * the whole point of the guard is to not need
+		 * the explicit unlock.
+		 */
+		__release(RCU);
+	} while (0),
+	rcu_read_unlock_notrace())
+
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0b0f56f6abc85..02cccca917a22 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -420,6 +420,13 @@ void __rcu_read_lock(void)
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
 
+notrace void __rcu_read_lock_notrace(void)
+{
+	rcu_preempt_read_enter();
+	barrier();  /* critical section after entry code. */
+}
+EXPORT_SYMBOL_GPL(__rcu_read_lock_notrace);
+
 /*
  * Preemptible RCU implementation for rcu_read_unlock().
  * Decrement ->rcu_read_lock_nesting.  If the result is zero (outermost
@@ -445,6 +452,13 @@ void __rcu_read_unlock(void)
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
 
+notrace void __rcu_read_unlock_notrace(void)
+{
+	barrier();  // critical section before exit code.
+	rcu_preempt_read_exit();
+}
+EXPORT_SYMBOL_GPL(__rcu_read_unlock_notrace);
+
 /*
  * Advance a ->blkd_tasks-list pointer to the next entry, instead
  * returning NULL if at the end of the list.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable()
  2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
  2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
@ 2025-06-13 15:22 ` Sebastian Andrzej Siewior
  2025-06-13 15:38 ` [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Mathieu Desnoyers
  2 siblings, 0 replies; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-13 15:22 UTC (permalink / raw)
  To: linux-rt-devel, rcu, linux-trace-kernel
  Cc: Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Mathieu Desnoyers,
	Neeraj Upadhyay, Steven Rostedt, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, Sebastian Andrzej Siewior

Switch from classic RCU to preemptible RCU.

This requires to change the guard in the __do_trace_ macro and account
for it in trace_event_buffer_reserve() where the preemption counter is
substracted. This is then no longer needed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/tracepoint.h  | 2 +-
 kernel/trace/trace_events.c | 8 +-------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 826ce3f8e1f85..15f851a1de55d 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -271,7 +271,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	static inline void __do_trace_##name(proto)			\
 	{								\
 		if (cond) {						\
-			guard(preempt_notrace)();			\
+			guard(rcu_notrace)();				\
 			__DO_TRACE_CALL(name, TP_ARGS(args));		\
 		}							\
 	}								\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 120531268abff..76ea1cf9f4d65 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -659,13 +659,7 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
 	    trace_event_ignore_this_pid(trace_file))
 		return NULL;
 
-	/*
-	 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
-	 * preemption (adding one to the preempt_count). Since we are
-	 * interested in the preempt_count at the time the tracepoint was
-	 * hit, we need to subtract one to offset the increment.
-	 */
-	fbuffer->trace_ctx = tracing_gen_ctx_dec();
+	fbuffer->trace_ctx = tracing_gen_ctx();
 	fbuffer->trace_file = trace_file;
 
 	fbuffer->event =
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU
  2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
  2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
  2025-06-13 15:22 ` [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable() Sebastian Andrzej Siewior
@ 2025-06-13 15:38 ` Mathieu Desnoyers
  2 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-06-13 15:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-rt-devel, rcu,
	linux-trace-kernel
  Cc: Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-06-13 11:22, Sebastian Andrzej Siewior wrote:
> This is a follow-up on
> 	https://lore.kernel.org/all/20241206120709.736f943e@gandalf.local.home/T/#u
> 
> The problem is that trace points disable preemption due to sched-RCU.
> The arguments passed to tracepoints may used locks (yes, they should
> not) and eBPF programs can be attached to tracepoints which may use
> locks which become sleeping locks on PREEMPT_RT.
> 
> Mathieu said regarding the need for preempt_notrace:
> | There are a few things to consider here about the constraints of the
> | callsites where the tracepoints are inserted. In general, those need to
> | be:
> |
> | - NMI-safe
> | - notrace
> | - usable from the scheduler (with rq lock held)
> | - usable to trace the RCU implementation
> 
> This is covered by the here suggested rcu_read_lock_notrace().
> 
> Mathieu's suggested steps were:
> 
> | Well the first step would be to introduce a rcu_read_lock/unlock_notrace.
> 
> Patch #1
> 
> | This solves the problem at the tracepoint level, but requires that we
> | initially move the preempt disable to the tracer callbacks. Then we
> | can figure out within each tracer what needs to be done to further
> | reduce the preempt off critical section.
> 
> Here I am stuck. Patch #2 is probably not correct due to "move the
> preempt disable to the tracer callbacks". I didn't figure out where they
> are…

As part of the faultable syscall tracepoint series, I've identified 
those sites for perf and ftrace. See those relevant commits:

commit 65e7462a16cea593025ca3b34c5d74e69b027ee0
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Tue Oct 8 21:07:13 2024 -0400

     tracing/perf: disable preemption in syscall probe

commit 13d750c2c03e9861e15268574ed2c239cca9c9d5
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Tue Oct 8 21:07:12 2024 -0400

     tracing/ftrace: disable preemption in syscall probe

You'll want to make sure the DECLARE_EVENT_CLASS also
does the preempt_{disable,enable}_notrace similarly
to the DECLARE_EVENT_SYSCALL_CLASS.

AFAIR something also had to be done to make sure migration
was disabled for ebpf syscall probes. There too you will want
to disable migration for normal tracepoints as well.

Hoping this helps,

Mathieu


> 
> Sebastian Andrzej Siewior (2):
>    rcu: Add rcu_read_lock_notrace()
>    trace: Use rcu_read_lock() instead preempt_disable()
> 
>   include/linux/rcupdate.h    | 41 +++++++++++++++++++++++++++++++++++++
>   include/linux/tracepoint.h  |  2 +-
>   kernel/rcu/tree_plugin.h    | 14 +++++++++++++
>   kernel/trace/trace_events.c |  8 +-------
>   4 files changed, 57 insertions(+), 8 deletions(-)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
@ 2025-06-18 17:21   ` Boqun Feng
  2025-06-20  8:43     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 49+ messages in thread
From: Boqun Feng @ 2025-06-18 17:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, rcu, linux-trace-kernel, Paul E. McKenney,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Fri, Jun 13, 2025 at 05:22:17PM +0200, Sebastian Andrzej Siewior wrote:
> rcu_read_lock_notrace() follows preempt_disable_notrace(). The idea is
> provide a rcu_read_lock() version so that the preempt_disable_notrace()
> user (which use RCU-sched under the hood) can migrate to preemptible
> RCU.
> The first user should be tracing (tracepoint) which is using the
> _notrace variant.
> 
> rcu_read_lock_notrace() is a slim version of rcu_read_lock(). It simply
> increments/ decrements the counter. It does not emit any warnings if
> RCU_NEST_PMAX is exceeded or patricipates in STRICT_GRACE_PERIOD. It
> also does not participate in rcu_read_unlock_special() as it would if
> invoke from NMI.
> 

If it doesn't participate in rcu_read_unlock_special() then it may not
report a QS in time, right? For example:

	CPU 2			CPU 3
	=====
	rcu_read_lock_notrace();
				synchronize_rcu();
				// need all CPUs report a QS.
	rcu_read_unlock_notrace();
	// no rcu_read_unlock_special() so QS will only be reported next
	// time we if another RCU read-side critical section exits.

Wouldn't this be a problem?

Regards,
Boqun

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[...]
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b0f56f6abc85..02cccca917a22 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -420,6 +420,13 @@ void __rcu_read_lock(void)
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_lock);
>  
> +notrace void __rcu_read_lock_notrace(void)
> +{
> +	rcu_preempt_read_enter();
> +	barrier();  /* critical section after entry code. */
> +}
> +EXPORT_SYMBOL_GPL(__rcu_read_lock_notrace);
> +
>  /*
>   * Preemptible RCU implementation for rcu_read_unlock().
>   * Decrement ->rcu_read_lock_nesting.  If the result is zero (outermost
> @@ -445,6 +452,13 @@ void __rcu_read_unlock(void)
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>  
> +notrace void __rcu_read_unlock_notrace(void)
> +{
> +	barrier();  // critical section before exit code.
> +	rcu_preempt_read_exit();
> +}
> +EXPORT_SYMBOL_GPL(__rcu_read_unlock_notrace);
> +
>  /*
>   * Advance a ->blkd_tasks-list pointer to the next entry, instead
>   * returning NULL if at the end of the list.
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-18 17:21   ` Boqun Feng
@ 2025-06-20  8:43     ` Sebastian Andrzej Siewior
  2025-06-20 11:23       ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-20  8:43 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-rt-devel, rcu, linux-trace-kernel, Paul E. McKenney,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-06-18 10:21:40 [-0700], Boqun Feng wrote:
> 
> If it doesn't participate in rcu_read_unlock_special() then it may not
> report a QS in time, right? For example:
> 
> 	CPU 2			CPU 3
> 	=====
> 	rcu_read_lock_notrace();
> 				synchronize_rcu();
> 				// need all CPUs report a QS.
> 	rcu_read_unlock_notrace();
> 	// no rcu_read_unlock_special() so QS will only be reported next
> 	// time we if another RCU read-side critical section exits.
> 
> Wouldn't this be a problem?

I hope not because it is not any different from

 	CPU 2			CPU 3
 	=====                   =====
	NMI
 	rcu_read_lock();
 				synchronize_rcu();
 				// need all CPUs report a QS.
 	rcu_read_unlock();
 	// no rcu_read_unlock_special() due to in_nmi().

If the NMI happens while the CPU is in userland (say a perf event) then
the NMI returns directly to userland.
After the tracing event completes (in this case) the CPU should run into
another RCU section on its way out via context switch or the tick
interrupt.
I assume the tick interrupt is what makes the NMI case work.

> Regards,
> Boqun

Sebastian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-20  8:43     ` Sebastian Andrzej Siewior
@ 2025-06-20 11:23       ` Paul E. McKenney
  2025-06-23 10:49         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-06-20 11:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Fri, Jun 20, 2025 at 10:43:34AM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-18 10:21:40 [-0700], Boqun Feng wrote:
> > 
> > If it doesn't participate in rcu_read_unlock_special() then it may not
> > report a QS in time, right? For example:
> > 
> > 	CPU 2			CPU 3
> > 	=====
> > 	rcu_read_lock_notrace();
> > 				synchronize_rcu();
> > 				// need all CPUs report a QS.
> > 	rcu_read_unlock_notrace();
> > 	// no rcu_read_unlock_special() so QS will only be reported next
> > 	// time we if another RCU read-side critical section exits.
> > 
> > Wouldn't this be a problem?
> 
> I hope not because it is not any different from
> 
>  	CPU 2			CPU 3
>  	=====                   =====
> 	NMI
>  	rcu_read_lock();
>  				synchronize_rcu();
>  				// need all CPUs report a QS.
>  	rcu_read_unlock();
>  	// no rcu_read_unlock_special() due to in_nmi().
> 
> If the NMI happens while the CPU is in userland (say a perf event) then
> the NMI returns directly to userland.
> After the tracing event completes (in this case) the CPU should run into
> another RCU section on its way out via context switch or the tick
> interrupt.
> I assume the tick interrupt is what makes the NMI case work.

Are you promising that interrupts will be always be disabled across
the whole rcu_read_lock_notrace() read-side critical section?  If so,
could we please have a lockdep_assert_irqs_disabled() call to check that?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-20 11:23       ` Paul E. McKenney
@ 2025-06-23 10:49         ` Sebastian Andrzej Siewior
  2025-06-23 18:13           ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-23 10:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > I hope not because it is not any different from
> > 
> >  	CPU 2			CPU 3
> >  	=====                   =====
> > 	NMI
> >  	rcu_read_lock();
> >  				synchronize_rcu();
> >  				// need all CPUs report a QS.
> >  	rcu_read_unlock();
> >  	// no rcu_read_unlock_special() due to in_nmi().
> > 
> > If the NMI happens while the CPU is in userland (say a perf event) then
> > the NMI returns directly to userland.
> > After the tracing event completes (in this case) the CPU should run into
> > another RCU section on its way out via context switch or the tick
> > interrupt.
> > I assume the tick interrupt is what makes the NMI case work.
> 
> Are you promising that interrupts will be always be disabled across
> the whole rcu_read_lock_notrace() read-side critical section?  If so,
> could we please have a lockdep_assert_irqs_disabled() call to check that?

No, that should stay preemptible because bpf can attach itself to
tracepoints and this is the root cause of the exercise. Now if you say
it has to be run with disabled interrupts to match the NMI case then it
makes sense (since NMIs have interrupts off) but I do not understand why
it matters here (since the CPU returns to userland without passing the
kernel).

I'm not sure how much can be done here due to the notrace part. Assuming
rcu_read_unlock_special() is not doable, would forcing a context switch
(via setting need-resched and irq_work, as the IRQ-off case) do the
trick?
Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
be "usable from the scheduler (with rq lock held)" due to RCU-boosting
or the wake of expedited_wq (which is one of the requirement).

> 							Thanx, Paul
Sebastian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-23 10:49         ` Sebastian Andrzej Siewior
@ 2025-06-23 18:13           ` Paul E. McKenney
  2025-07-07 21:56             ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-06-23 18:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > > I hope not because it is not any different from
> > > 
> > >  	CPU 2			CPU 3
> > >  	=====                   =====
> > > 	NMI
> > >  	rcu_read_lock();
> > >  				synchronize_rcu();
> > >  				// need all CPUs report a QS.
> > >  	rcu_read_unlock();
> > >  	// no rcu_read_unlock_special() due to in_nmi().
> > > 
> > > If the NMI happens while the CPU is in userland (say a perf event) then
> > > the NMI returns directly to userland.
> > > After the tracing event completes (in this case) the CPU should run into
> > > another RCU section on its way out via context switch or the tick
> > > interrupt.
> > > I assume the tick interrupt is what makes the NMI case work.
> > 
> > Are you promising that interrupts will be always be disabled across
> > the whole rcu_read_lock_notrace() read-side critical section?  If so,
> > could we please have a lockdep_assert_irqs_disabled() call to check that?
> 
> No, that should stay preemptible because bpf can attach itself to
> tracepoints and this is the root cause of the exercise. Now if you say
> it has to be run with disabled interrupts to match the NMI case then it
> makes sense (since NMIs have interrupts off) but I do not understand why
> it matters here (since the CPU returns to userland without passing the
> kernel).

Given your patch, if you don't disable interrupts in a preemptible kernel
across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
concurrent expedited grace period might send its IPI in the middle of that
critical section.  That IPI handler would set up state so that the next
rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
Except that without the call to rcu_read_unlock_special(), there might
not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().

This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
Then if that critical section was preempted and then priority-boosted,
the unboosting also won't happen until the next call to that same
rcu_preempt_deferred_qs_irqrestore() function, which again might not
happen.  Or might be significantly delayed.

Or am I missing some trick that fixes all of this?

> I'm not sure how much can be done here due to the notrace part. Assuming
> rcu_read_unlock_special() is not doable, would forcing a context switch
> (via setting need-resched and irq_work, as the IRQ-off case) do the
> trick?
> Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
> be "usable from the scheduler (with rq lock held)" due to RCU-boosting
> or the wake of expedited_wq (which is one of the requirement).

But if rq_lock is held, then interrupts are disabled, which will
cause the unboosting to be deferred.

Or are the various deferral mechanisms also unusable in this context?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-06-23 18:13           ` Paul E. McKenney
@ 2025-07-07 21:56             ` Paul E. McKenney
  2025-07-08 19:40               ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-07 21:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > > > I hope not because it is not any different from
> > > > 
> > > >  	CPU 2			CPU 3
> > > >  	=====                   =====
> > > > 	NMI
> > > >  	rcu_read_lock();
> > > >  				synchronize_rcu();
> > > >  				// need all CPUs report a QS.
> > > >  	rcu_read_unlock();
> > > >  	// no rcu_read_unlock_special() due to in_nmi().
> > > > 
> > > > If the NMI happens while the CPU is in userland (say a perf event) then
> > > > the NMI returns directly to userland.
> > > > After the tracing event completes (in this case) the CPU should run into
> > > > another RCU section on its way out via context switch or the tick
> > > > interrupt.
> > > > I assume the tick interrupt is what makes the NMI case work.
> > > 
> > > Are you promising that interrupts will be always be disabled across
> > > the whole rcu_read_lock_notrace() read-side critical section?  If so,
> > > could we please have a lockdep_assert_irqs_disabled() call to check that?
> > 
> > No, that should stay preemptible because bpf can attach itself to
> > tracepoints and this is the root cause of the exercise. Now if you say
> > it has to be run with disabled interrupts to match the NMI case then it
> > makes sense (since NMIs have interrupts off) but I do not understand why
> > it matters here (since the CPU returns to userland without passing the
> > kernel).
> 
> Given your patch, if you don't disable interrupts in a preemptible kernel
> across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
> concurrent expedited grace period might send its IPI in the middle of that
> critical section.  That IPI handler would set up state so that the next
> rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
> Except that without the call to rcu_read_unlock_special(), there might
> not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
> 
> This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
> Then if that critical section was preempted and then priority-boosted,
> the unboosting also won't happen until the next call to that same
> rcu_preempt_deferred_qs_irqrestore() function, which again might not
> happen.  Or might be significantly delayed.
> 
> Or am I missing some trick that fixes all of this?
> 
> > I'm not sure how much can be done here due to the notrace part. Assuming
> > rcu_read_unlock_special() is not doable, would forcing a context switch
> > (via setting need-resched and irq_work, as the IRQ-off case) do the
> > trick?
> > Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
> > be "usable from the scheduler (with rq lock held)" due to RCU-boosting
> > or the wake of expedited_wq (which is one of the requirement).
> 
> But if rq_lock is held, then interrupts are disabled, which will
> cause the unboosting to be deferred.
> 
> Or are the various deferral mechanisms also unusable in this context?

OK, looking back through this thread, it appears that you need both
an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
covered by Mathieu's list of requirements [1]:

| - NMI-safe
	This is covered by the existing rcu_read_lock() and
	rcu_read_unlock().
| - notrace
	I am guessing that by "notrace", you mean the "notrace" CPP
	macro attribute defined in include/linux/compiler_types.h.
	This has no fewer than four different definitions, so I will
	need some help understanding what the restrictions are.
| - usable from the scheduler (with rq lock held)
	This is covered by the existing rcu_read_lock() and
	rcu_read_unlock().
| - usable to trace the RCU implementation
	This one I don't understand.  Can I put tracepoints on
	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
	I was assuming that tracepoints would be forbidden.  Until I
	reached this requirement, that is.

One possible path forward is to ensure that rcu_read_unlock_special()
calls only functions that are compatible with the notrace/trace
requirements.  The ones that look like they might need some help are
raise_softirq_irqoff() and irq_work_queue_on().  Note that although
rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
avoid its being invoked, for example, by disabing interrupts across the
call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
do the disabling.

However, I could easily be missing something, especially given my being
confused by the juxtaposition of "notrace" and "usable to trace the
RCU implementation".  These appear to me to be contradicting each other.

Help?

							Thanx, Paul

[1] https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-07 21:56             ` Paul E. McKenney
@ 2025-07-08 19:40               ` Mathieu Desnoyers
  2025-07-08 20:49                 ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-08 19:40 UTC (permalink / raw)
  To: paulmck, Sebastian Andrzej Siewior
  Cc: Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Steven Rostedt,
	Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-07-07 17:56, Paul E. McKenney wrote:
> On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
>> On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
>>> On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
>>>>> I hope not because it is not any different from
>>>>>
>>>>>   	CPU 2			CPU 3
>>>>>   	=====                   =====
>>>>> 	NMI
>>>>>   	rcu_read_lock();
>>>>>   				synchronize_rcu();
>>>>>   				// need all CPUs report a QS.
>>>>>   	rcu_read_unlock();
>>>>>   	// no rcu_read_unlock_special() due to in_nmi().
>>>>>
>>>>> If the NMI happens while the CPU is in userland (say a perf event) then
>>>>> the NMI returns directly to userland.
>>>>> After the tracing event completes (in this case) the CPU should run into
>>>>> another RCU section on its way out via context switch or the tick
>>>>> interrupt.
>>>>> I assume the tick interrupt is what makes the NMI case work.
>>>>
>>>> Are you promising that interrupts will be always be disabled across
>>>> the whole rcu_read_lock_notrace() read-side critical section?  If so,
>>>> could we please have a lockdep_assert_irqs_disabled() call to check that?
>>>
>>> No, that should stay preemptible because bpf can attach itself to
>>> tracepoints and this is the root cause of the exercise. Now if you say
>>> it has to be run with disabled interrupts to match the NMI case then it
>>> makes sense (since NMIs have interrupts off) but I do not understand why
>>> it matters here (since the CPU returns to userland without passing the
>>> kernel).
>>
>> Given your patch, if you don't disable interrupts in a preemptible kernel
>> across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
>> concurrent expedited grace period might send its IPI in the middle of that
>> critical section.  That IPI handler would set up state so that the next
>> rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
>> Except that without the call to rcu_read_unlock_special(), there might
>> not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
>>
>> This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
>> Then if that critical section was preempted and then priority-boosted,
>> the unboosting also won't happen until the next call to that same
>> rcu_preempt_deferred_qs_irqrestore() function, which again might not
>> happen.  Or might be significantly delayed.
>>
>> Or am I missing some trick that fixes all of this?
>>
>>> I'm not sure how much can be done here due to the notrace part. Assuming
>>> rcu_read_unlock_special() is not doable, would forcing a context switch
>>> (via setting need-resched and irq_work, as the IRQ-off case) do the
>>> trick?
>>> Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
>>> be "usable from the scheduler (with rq lock held)" due to RCU-boosting
>>> or the wake of expedited_wq (which is one of the requirement).
>>
>> But if rq_lock is held, then interrupts are disabled, which will
>> cause the unboosting to be deferred.
>>
>> Or are the various deferral mechanisms also unusable in this context?
> 
> OK, looking back through this thread, it appears that you need both
> an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
> covered by Mathieu's list of requirements [1]:
> 

I'm just jumping into this email thread, so I'll try to clarify
what I may.

> | - NMI-safe
> 	This is covered by the existing rcu_read_lock() and
> 	rcu_read_unlock().

OK

> | - notrace
> 	I am guessing that by "notrace", you mean the "notrace" CPP
> 	macro attribute defined in include/linux/compiler_types.h.
> 	This has no fewer than four different definitions, so I will
> 	need some help understanding what the restrictions are.

When I listed notrace in my desiderata, I had in mind the
preempt_{disable,enable}_notrace macros, which ensure that
those macros don't call instrumentation, and therefore can be
used from the implementation of tracepoint instrumentation or
from a tracer callback.

> | - usable from the scheduler (with rq lock held)
> 	This is covered by the existing rcu_read_lock() and
> 	rcu_read_unlock().

OK

> | - usable to trace the RCU implementation
> 	This one I don't understand.  Can I put tracepoints on
> 	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
> 	I was assuming that tracepoints would be forbidden.  Until I
> 	reached this requirement, that is.

At a high level, a rcu_read_{lock,unlock}_notrace should be something
that is safe to call from the tracepoint implementation or from a
tracer callback.

My expectation is that the "normal" (not notrace) RCU APIs would
be instrumented with tracepoints, and tracepoints and tracer callbacks
are allowed to use the _notrace RCU APIs.

This provides instrumentation coverage of RCU with the exception of the
_notrace users, which is pretty much the best we can hope for without
having a circular dependency.

> 
> One possible path forward is to ensure that rcu_read_unlock_special()
> calls only functions that are compatible with the notrace/trace
> requirements.  The ones that look like they might need some help are
> raise_softirq_irqoff() and irq_work_queue_on().  Note that although
> rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
> avoid its being invoked, for example, by disabing interrupts across the
> call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
> do the disabling.
> 
> However, I could easily be missing something, especially given my being
> confused by the juxtaposition of "notrace" and "usable to trace the
> RCU implementation".  These appear to me to be contradicting each other.
> 
> Help?

You indeed need to ensure that everything that is called from
rcu_{lock,unlock]_notrace don't end up executing instrumentation
to prevent a circular dependency. You hinted at a few ways to achieve
this. Other possible approaches:

- Add a "trace" bool parameter to rcu_read_unlock_special(),
- Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
- Keep some nesting count in the task struct to prevent calling the
   instrumentation when nested in notrace,

There are probably other possible approaches I am missing, each with
their respective trade offs.

Thanks,

Mathieu


> 
> 							Thanx, Paul
> 
> [1] https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-08 19:40               ` Mathieu Desnoyers
@ 2025-07-08 20:49                 ` Paul E. McKenney
  2025-07-09 14:31                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-08 20:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-07 17:56, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > > > > > I hope not because it is not any different from
> > > > > > 
> > > > > >   	CPU 2			CPU 3
> > > > > >   	=====                   =====
> > > > > > 	NMI
> > > > > >   	rcu_read_lock();
> > > > > >   				synchronize_rcu();
> > > > > >   				// need all CPUs report a QS.
> > > > > >   	rcu_read_unlock();
> > > > > >   	// no rcu_read_unlock_special() due to in_nmi().
> > > > > > 
> > > > > > If the NMI happens while the CPU is in userland (say a perf event) then
> > > > > > the NMI returns directly to userland.
> > > > > > After the tracing event completes (in this case) the CPU should run into
> > > > > > another RCU section on its way out via context switch or the tick
> > > > > > interrupt.
> > > > > > I assume the tick interrupt is what makes the NMI case work.
> > > > > 
> > > > > Are you promising that interrupts will be always be disabled across
> > > > > the whole rcu_read_lock_notrace() read-side critical section?  If so,
> > > > > could we please have a lockdep_assert_irqs_disabled() call to check that?
> > > > 
> > > > No, that should stay preemptible because bpf can attach itself to
> > > > tracepoints and this is the root cause of the exercise. Now if you say
> > > > it has to be run with disabled interrupts to match the NMI case then it
> > > > makes sense (since NMIs have interrupts off) but I do not understand why
> > > > it matters here (since the CPU returns to userland without passing the
> > > > kernel).
> > > 
> > > Given your patch, if you don't disable interrupts in a preemptible kernel
> > > across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
> > > concurrent expedited grace period might send its IPI in the middle of that
> > > critical section.  That IPI handler would set up state so that the next
> > > rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
> > > Except that without the call to rcu_read_unlock_special(), there might
> > > not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
> > > 
> > > This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
> > > Then if that critical section was preempted and then priority-boosted,
> > > the unboosting also won't happen until the next call to that same
> > > rcu_preempt_deferred_qs_irqrestore() function, which again might not
> > > happen.  Or might be significantly delayed.
> > > 
> > > Or am I missing some trick that fixes all of this?
> > > 
> > > > I'm not sure how much can be done here due to the notrace part. Assuming
> > > > rcu_read_unlock_special() is not doable, would forcing a context switch
> > > > (via setting need-resched and irq_work, as the IRQ-off case) do the
> > > > trick?
> > > > Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
> > > > be "usable from the scheduler (with rq lock held)" due to RCU-boosting
> > > > or the wake of expedited_wq (which is one of the requirement).
> > > 
> > > But if rq_lock is held, then interrupts are disabled, which will
> > > cause the unboosting to be deferred.
> > > 
> > > Or are the various deferral mechanisms also unusable in this context?
> > 
> > OK, looking back through this thread, it appears that you need both
> > an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
> > covered by Mathieu's list of requirements [1]:
> 
> I'm just jumping into this email thread, so I'll try to clarify
> what I may.

Thank you for jumping in!

> > | - NMI-safe
> > 	This is covered by the existing rcu_read_lock() and
> > 	rcu_read_unlock().
> 
> OK
> 
> > | - notrace
> > 	I am guessing that by "notrace", you mean the "notrace" CPP
> > 	macro attribute defined in include/linux/compiler_types.h.
> > 	This has no fewer than four different definitions, so I will
> > 	need some help understanding what the restrictions are.
> 
> When I listed notrace in my desiderata, I had in mind the
> preempt_{disable,enable}_notrace macros, which ensure that
> those macros don't call instrumentation, and therefore can be
> used from the implementation of tracepoint instrumentation or
> from a tracer callback.

OK, that makes sense.  I have some questions:

Are interrupts disabled at the calls to rcu_read_unlock_notrace()?

If interrupts are instead enabled, is it OK for an IRQ-work handler that
did an immediate self-interrupt and a bunch of non-notrace work?

If interrupts are to be enabled, but an immediate IRQ-work handler is
ruled out, what mechanism should I be using to defer the work, given
that it cannot be deferred for very long without messing up real-time
latencies?  As in a delay of nanoseconds or maybe a microsecond, but
not multiple microseconds let alone milliseconds?

(I could imagine short-timeout hrtimer, but I am not seeing notrace
variants of these guys, either.)

> > | - usable from the scheduler (with rq lock held)
> > 	This is covered by the existing rcu_read_lock() and
> > 	rcu_read_unlock().
> 
> OK
> 
> > | - usable to trace the RCU implementation
> > 	This one I don't understand.  Can I put tracepoints on
> > 	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
> > 	I was assuming that tracepoints would be forbidden.  Until I
> > 	reached this requirement, that is.
> 
> At a high level, a rcu_read_{lock,unlock}_notrace should be something
> that is safe to call from the tracepoint implementation or from a
> tracer callback.
> 
> My expectation is that the "normal" (not notrace) RCU APIs would
> be instrumented with tracepoints, and tracepoints and tracer callbacks
> are allowed to use the _notrace RCU APIs.
> 
> This provides instrumentation coverage of RCU with the exception of the
> _notrace users, which is pretty much the best we can hope for without
> having a circular dependency.

OK, got it, and that does make sense.  I am not sure how I would have
intuited that from the description, but what is life without a challenge?

> > One possible path forward is to ensure that rcu_read_unlock_special()
> > calls only functions that are compatible with the notrace/trace
> > requirements.  The ones that look like they might need some help are
> > raise_softirq_irqoff() and irq_work_queue_on().  Note that although
> > rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
> > avoid its being invoked, for example, by disabing interrupts across the
> > call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
> > do the disabling.
> > 
> > However, I could easily be missing something, especially given my being
> > confused by the juxtaposition of "notrace" and "usable to trace the
> > RCU implementation".  These appear to me to be contradicting each other.
> > 
> > Help?
> 
> You indeed need to ensure that everything that is called from
> rcu_{lock,unlock]_notrace don't end up executing instrumentation
> to prevent a circular dependency. You hinted at a few ways to achieve
> this. Other possible approaches:
> 
> - Add a "trace" bool parameter to rcu_read_unlock_special(),
> - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.

OK, both of these are reasonable alternatives for the API, but it will
still be necessary to figure out how to make the notrace-incompatible
work happen.

> - Keep some nesting count in the task struct to prevent calling the
>   instrumentation when nested in notrace,

OK, for this one, is the idea to invoke some TBD RCU API the tracing
exits the notrace region?  I could see that working.  But there would
need to be a guarantee that if the notrace API was invoked, a call to
this TBD RCU API would follow in short order.  And I suspect that
preemption (and probably also interrupts) would need to be disabled
across this region.

> There are probably other possible approaches I am missing, each with
> their respective trade offs.

I am pretty sure that we also have some ways to go before we have the
requirements fully laid out, for that matter.  ;-)

Could you please tell me where in the current tracing code these
rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-08 20:49                 ` Paul E. McKenney
@ 2025-07-09 14:31                   ` Mathieu Desnoyers
  2025-07-09 18:33                     ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-09 14:31 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-07-08 16:49, Paul E. McKenney wrote:
> On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-07 17:56, Paul E. McKenney wrote:
>>> On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
>>>> On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
>>>>> On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
>>>>>>> I hope not because it is not any different from
>>>>>>>
>>>>>>>    	CPU 2			CPU 3
>>>>>>>    	=====                   =====
>>>>>>> 	NMI
>>>>>>>    	rcu_read_lock();
>>>>>>>    				synchronize_rcu();
>>>>>>>    				// need all CPUs report a QS.
>>>>>>>    	rcu_read_unlock();
>>>>>>>    	// no rcu_read_unlock_special() due to in_nmi().
>>>>>>>
>>>>>>> If the NMI happens while the CPU is in userland (say a perf event) then
>>>>>>> the NMI returns directly to userland.
>>>>>>> After the tracing event completes (in this case) the CPU should run into
>>>>>>> another RCU section on its way out via context switch or the tick
>>>>>>> interrupt.
>>>>>>> I assume the tick interrupt is what makes the NMI case work.
>>>>>>
>>>>>> Are you promising that interrupts will be always be disabled across
>>>>>> the whole rcu_read_lock_notrace() read-side critical section?  If so,
>>>>>> could we please have a lockdep_assert_irqs_disabled() call to check that?
>>>>>
>>>>> No, that should stay preemptible because bpf can attach itself to
>>>>> tracepoints and this is the root cause of the exercise. Now if you say
>>>>> it has to be run with disabled interrupts to match the NMI case then it
>>>>> makes sense (since NMIs have interrupts off) but I do not understand why
>>>>> it matters here (since the CPU returns to userland without passing the
>>>>> kernel).
>>>>
>>>> Given your patch, if you don't disable interrupts in a preemptible kernel
>>>> across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
>>>> concurrent expedited grace period might send its IPI in the middle of that
>>>> critical section.  That IPI handler would set up state so that the next
>>>> rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
>>>> Except that without the call to rcu_read_unlock_special(), there might
>>>> not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
>>>>
>>>> This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
>>>> Then if that critical section was preempted and then priority-boosted,
>>>> the unboosting also won't happen until the next call to that same
>>>> rcu_preempt_deferred_qs_irqrestore() function, which again might not
>>>> happen.  Or might be significantly delayed.
>>>>
>>>> Or am I missing some trick that fixes all of this?
>>>>
>>>>> I'm not sure how much can be done here due to the notrace part. Assuming
>>>>> rcu_read_unlock_special() is not doable, would forcing a context switch
>>>>> (via setting need-resched and irq_work, as the IRQ-off case) do the
>>>>> trick?
>>>>> Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
>>>>> be "usable from the scheduler (with rq lock held)" due to RCU-boosting
>>>>> or the wake of expedited_wq (which is one of the requirement).
>>>>
>>>> But if rq_lock is held, then interrupts are disabled, which will
>>>> cause the unboosting to be deferred.
>>>>
>>>> Or are the various deferral mechanisms also unusable in this context?
>>>
>>> OK, looking back through this thread, it appears that you need both
>>> an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
>>> covered by Mathieu's list of requirements [1]:
>>
>> I'm just jumping into this email thread, so I'll try to clarify
>> what I may.
> 
> Thank you for jumping in!
> 
>>> | - NMI-safe
>>> 	This is covered by the existing rcu_read_lock() and
>>> 	rcu_read_unlock().
>>
>> OK
>>
>>> | - notrace
>>> 	I am guessing that by "notrace", you mean the "notrace" CPP
>>> 	macro attribute defined in include/linux/compiler_types.h.
>>> 	This has no fewer than four different definitions, so I will
>>> 	need some help understanding what the restrictions are.
>>
>> When I listed notrace in my desiderata, I had in mind the
>> preempt_{disable,enable}_notrace macros, which ensure that
>> those macros don't call instrumentation, and therefore can be
>> used from the implementation of tracepoint instrumentation or
>> from a tracer callback.
> 
> OK, that makes sense.  I have some questions:
> 
> Are interrupts disabled at the calls to rcu_read_unlock_notrace()?

No.

> 
> If interrupts are instead enabled, is it OK for an IRQ-work handler that
> did an immediate self-interrupt and a bunch of non-notrace work?
> 
> If interrupts are to be enabled, but an immediate IRQ-work handler is
> ruled out, what mechanism should I be using to defer the work, given
> that it cannot be deferred for very long without messing up real-time
> latencies?  As in a delay of nanoseconds or maybe a microsecond, but
> not multiple microseconds let alone milliseconds?
> 
> (I could imagine short-timeout hrtimer, but I am not seeing notrace
> variants of these guys, either.)

I have nothing against an immediate IRQ-work, but I'm worried about
_nested_ immediate IRQ-work, where we end up triggering a endless
recursion of IRQ-work through instrumentation.

Ideally we would want to figure out a way to prevent endless recursion,
while keeping the immediate IRQ-work within rcu_read_unlock_notrace()
to keep RT latency within bounded ranges, but without adding unwanted
overhead by adding too many conditional branches to the fast-paths.

Is there a way to issue a given IRQ-work only when not nested in that
IRQ-work handler ? Any way we can detect and prevent that recursion
should work fine.

> 
>>> | - usable from the scheduler (with rq lock held)
>>> 	This is covered by the existing rcu_read_lock() and
>>> 	rcu_read_unlock().
>>
>> OK
>>
>>> | - usable to trace the RCU implementation
>>> 	This one I don't understand.  Can I put tracepoints on
>>> 	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
>>> 	I was assuming that tracepoints would be forbidden.  Until I
>>> 	reached this requirement, that is.
>>
>> At a high level, a rcu_read_{lock,unlock}_notrace should be something
>> that is safe to call from the tracepoint implementation or from a
>> tracer callback.
>>
>> My expectation is that the "normal" (not notrace) RCU APIs would
>> be instrumented with tracepoints, and tracepoints and tracer callbacks
>> are allowed to use the _notrace RCU APIs.
>>
>> This provides instrumentation coverage of RCU with the exception of the
>> _notrace users, which is pretty much the best we can hope for without
>> having a circular dependency.
> 
> OK, got it, and that does make sense.  I am not sure how I would have
> intuited that from the description, but what is life without a challenge?

That certainly makes life interesting !

As a side-note, I think the "_notrace" suffix I've described comes from
the "notrace" function attribute background, which is indeed also used
to prevent function tracing of the annotated functions, for similar
purposes.

Kprobes also has annotation mechanisms to prevent inserting breakpoints
in specific functions, and in other cases we rely on compiler flags to
prevent instrumentation of entire objects.

But mostly the goal of all of those mechanisms is the same: allow some
kernel code to be used from instrumentation and tracer callbacks without
triggering endless recursion.

> 
>>> One possible path forward is to ensure that rcu_read_unlock_special()
>>> calls only functions that are compatible with the notrace/trace
>>> requirements.  The ones that look like they might need some help are
>>> raise_softirq_irqoff() and irq_work_queue_on().  Note that although
>>> rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
>>> avoid its being invoked, for example, by disabing interrupts across the
>>> call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
>>> do the disabling.
>>>
>>> However, I could easily be missing something, especially given my being
>>> confused by the juxtaposition of "notrace" and "usable to trace the
>>> RCU implementation".  These appear to me to be contradicting each other.
>>>
>>> Help?
>>
>> You indeed need to ensure that everything that is called from
>> rcu_{lock,unlock]_notrace don't end up executing instrumentation
>> to prevent a circular dependency. You hinted at a few ways to achieve
>> this. Other possible approaches:
>>
>> - Add a "trace" bool parameter to rcu_read_unlock_special(),
>> - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
> 
> OK, both of these are reasonable alternatives for the API, but it will
> still be necessary to figure out how to make the notrace-incompatible
> work happen.

One downside of those two approaches is that they require to somewhat
duplicate the code (trace vs notrace). This makes it tricky in the
case of irq work, because the irq work is just some interrupt, so we're
limited in how we can pass around parameters that would use the
notrace code.

> 
>> - Keep some nesting count in the task struct to prevent calling the
>>    instrumentation when nested in notrace,
> 
> OK, for this one, is the idea to invoke some TBD RCU API the tracing
> exits the notrace region?  I could see that working.  But there would
> need to be a guarantee that if the notrace API was invoked, a call to
> this TBD RCU API would follow in short order.  And I suspect that
> preemption (and probably also interrupts) would need to be disabled
> across this region.

No quite.

What I have in mind is to try to find the most elegant way to prevent
endless recursion of the irq work issued immediately on
rcu_read_unlock_notrace without slowing down most fast paths, and
ideally without too much code duplication.

I'm not entirely sure what would be the best approach though.

> 
>> There are probably other possible approaches I am missing, each with
>> their respective trade offs.
> 
> I am pretty sure that we also have some ways to go before we have the
> requirements fully laid out, for that matter.  ;-)
> 
> Could you please tell me where in the current tracing code these
> rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?

AFAIU here:

include/linux/tracepoint.h:

#define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]

         static inline void __do_trace_##name(proto)                     \
         {                                                               \
                 if (cond) {                                             \
                         guard(preempt_notrace)();                       \
                         __DO_TRACE_CALL(name, TP_ARGS(args));           \
                 }                                                       \
         }                                                               \
         static inline void trace_##name(proto)                          \
         {                                                               \
                 if (static_branch_unlikely(&__tracepoint_##name.key))   \
                         __do_trace_##name(args);                        \
                 if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
                         WARN_ONCE(!rcu_is_watching(),                   \
                                   "RCU not watching for tracepoint");   \
                 }                                                       \
         }

and

#define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]

         static inline void __do_trace_##name(proto)                     \
         {                                                               \
                 guard(rcu_tasks_trace)();                               \
                 __DO_TRACE_CALL(name, TP_ARGS(args));                   \
         }                                                               \
         static inline void trace_##name(proto)                          \
         {                                                               \
                 might_fault();                                          \
                 if (static_branch_unlikely(&__tracepoint_##name.key))   \
                         __do_trace_##name(args);                        \
                 if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
                         WARN_ONCE(!rcu_is_watching(),                   \
                                   "RCU not watching for tracepoint");   \
                 }                                                       \
         }

Thanks,

Mathieu


> 
> 							Thanx, Paul


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-09 14:31                   ` Mathieu Desnoyers
@ 2025-07-09 18:33                     ` Paul E. McKenney
  2025-07-11 13:46                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-09 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-08 16:49, Paul E. McKenney wrote:
> > On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-07 17:56, Paul E. McKenney wrote:
> > > > On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > > On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > > > > > > > I hope not because it is not any different from
> > > > > > > > 
> > > > > > > >    	CPU 2			CPU 3
> > > > > > > >    	=====                   =====
> > > > > > > > 	NMI
> > > > > > > >    	rcu_read_lock();
> > > > > > > >    				synchronize_rcu();
> > > > > > > >    				// need all CPUs report a QS.
> > > > > > > >    	rcu_read_unlock();
> > > > > > > >    	// no rcu_read_unlock_special() due to in_nmi().
> > > > > > > > 
> > > > > > > > If the NMI happens while the CPU is in userland (say a perf event) then
> > > > > > > > the NMI returns directly to userland.
> > > > > > > > After the tracing event completes (in this case) the CPU should run into
> > > > > > > > another RCU section on its way out via context switch or the tick
> > > > > > > > interrupt.
> > > > > > > > I assume the tick interrupt is what makes the NMI case work.
> > > > > > > 
> > > > > > > Are you promising that interrupts will be always be disabled across
> > > > > > > the whole rcu_read_lock_notrace() read-side critical section?  If so,
> > > > > > > could we please have a lockdep_assert_irqs_disabled() call to check that?
> > > > > > 
> > > > > > No, that should stay preemptible because bpf can attach itself to
> > > > > > tracepoints and this is the root cause of the exercise. Now if you say
> > > > > > it has to be run with disabled interrupts to match the NMI case then it
> > > > > > makes sense (since NMIs have interrupts off) but I do not understand why
> > > > > > it matters here (since the CPU returns to userland without passing the
> > > > > > kernel).
> > > > > 
> > > > > Given your patch, if you don't disable interrupts in a preemptible kernel
> > > > > across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
> > > > > concurrent expedited grace period might send its IPI in the middle of that
> > > > > critical section.  That IPI handler would set up state so that the next
> > > > > rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
> > > > > Except that without the call to rcu_read_unlock_special(), there might
> > > > > not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
> > > > > 
> > > > > This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
> > > > > Then if that critical section was preempted and then priority-boosted,
> > > > > the unboosting also won't happen until the next call to that same
> > > > > rcu_preempt_deferred_qs_irqrestore() function, which again might not
> > > > > happen.  Or might be significantly delayed.
> > > > > 
> > > > > Or am I missing some trick that fixes all of this?
> > > > > 
> > > > > > I'm not sure how much can be done here due to the notrace part. Assuming
> > > > > > rcu_read_unlock_special() is not doable, would forcing a context switch
> > > > > > (via setting need-resched and irq_work, as the IRQ-off case) do the
> > > > > > trick?
> > > > > > Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
> > > > > > be "usable from the scheduler (with rq lock held)" due to RCU-boosting
> > > > > > or the wake of expedited_wq (which is one of the requirement).
> > > > > 
> > > > > But if rq_lock is held, then interrupts are disabled, which will
> > > > > cause the unboosting to be deferred.
> > > > > 
> > > > > Or are the various deferral mechanisms also unusable in this context?
> > > > 
> > > > OK, looking back through this thread, it appears that you need both
> > > > an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
> > > > covered by Mathieu's list of requirements [1]:
> > > 
> > > I'm just jumping into this email thread, so I'll try to clarify
> > > what I may.
> > 
> > Thank you for jumping in!
> > 
> > > > | - NMI-safe
> > > > 	This is covered by the existing rcu_read_lock() and
> > > > 	rcu_read_unlock().
> > > 
> > > OK
> > > 
> > > > | - notrace
> > > > 	I am guessing that by "notrace", you mean the "notrace" CPP
> > > > 	macro attribute defined in include/linux/compiler_types.h.
> > > > 	This has no fewer than four different definitions, so I will
> > > > 	need some help understanding what the restrictions are.
> > > 
> > > When I listed notrace in my desiderata, I had in mind the
> > > preempt_{disable,enable}_notrace macros, which ensure that
> > > those macros don't call instrumentation, and therefore can be
> > > used from the implementation of tracepoint instrumentation or
> > > from a tracer callback.
> > 
> > OK, that makes sense.  I have some questions:
> > 
> > Are interrupts disabled at the calls to rcu_read_unlock_notrace()?
> 
> No.

OK, so much for that family of approaches!  ;-)

> > If interrupts are instead enabled, is it OK for an IRQ-work handler that
> > did an immediate self-interrupt and a bunch of non-notrace work?
> > 
> > If interrupts are to be enabled, but an immediate IRQ-work handler is
> > ruled out, what mechanism should I be using to defer the work, given
> > that it cannot be deferred for very long without messing up real-time
> > latencies?  As in a delay of nanoseconds or maybe a microsecond, but
> > not multiple microseconds let alone milliseconds?
> > 
> > (I could imagine short-timeout hrtimer, but I am not seeing notrace
> > variants of these guys, either.)
> 
> I have nothing against an immediate IRQ-work, but I'm worried about
> _nested_ immediate IRQ-work, where we end up triggering a endless
> recursion of IRQ-work through instrumentation.
> 
> Ideally we would want to figure out a way to prevent endless recursion,
> while keeping the immediate IRQ-work within rcu_read_unlock_notrace()
> to keep RT latency within bounded ranges, but without adding unwanted
> overhead by adding too many conditional branches to the fast-paths.
> 
> Is there a way to issue a given IRQ-work only when not nested in that
> IRQ-work handler ? Any way we can detect and prevent that recursion
> should work fine.

You are looking for this commit from Joel Fernandes:

3284e4adca9b ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")

This is in the shiny-ish new-ish shared RCU tree at:

	git@gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux

Or Message-Id <20250709104118.15532-6-neeraj.upadhyay@kernel.org>.

> > > > | - usable from the scheduler (with rq lock held)
> > > > 	This is covered by the existing rcu_read_lock() and
> > > > 	rcu_read_unlock().
> > > 
> > > OK
> > > 
> > > > | - usable to trace the RCU implementation
> > > > 	This one I don't understand.  Can I put tracepoints on
> > > > 	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
> > > > 	I was assuming that tracepoints would be forbidden.  Until I
> > > > 	reached this requirement, that is.
> > > 
> > > At a high level, a rcu_read_{lock,unlock}_notrace should be something
> > > that is safe to call from the tracepoint implementation or from a
> > > tracer callback.
> > > 
> > > My expectation is that the "normal" (not notrace) RCU APIs would
> > > be instrumented with tracepoints, and tracepoints and tracer callbacks
> > > are allowed to use the _notrace RCU APIs.
> > > 
> > > This provides instrumentation coverage of RCU with the exception of the
> > > _notrace users, which is pretty much the best we can hope for without
> > > having a circular dependency.
> > 
> > OK, got it, and that does make sense.  I am not sure how I would have
> > intuited that from the description, but what is life without a challenge?
> 
> That certainly makes life interesting !
> 
> As a side-note, I think the "_notrace" suffix I've described comes from
> the "notrace" function attribute background, which is indeed also used
> to prevent function tracing of the annotated functions, for similar
> purposes.
> 
> Kprobes also has annotation mechanisms to prevent inserting breakpoints
> in specific functions, and in other cases we rely on compiler flags to
> prevent instrumentation of entire objects.
> 
> But mostly the goal of all of those mechanisms is the same: allow some
> kernel code to be used from instrumentation and tracer callbacks without
> triggering endless recursion.

OK, so is there some tracing anti-recursion flag in effect?
Or is there something else that I must do before invoking either
raise_softirq_irqoff() or irq_work_queue_on()?

Plus RCU does need *some* hint that it is not supposed to invoke
rcu_preempt_deferred_qs_irqrestore() in this case.  Which might be
why you are suggesting an rcu_read_unlock_notrace().

> > > > One possible path forward is to ensure that rcu_read_unlock_special()
> > > > calls only functions that are compatible with the notrace/trace
> > > > requirements.  The ones that look like they might need some help are
> > > > raise_softirq_irqoff() and irq_work_queue_on().  Note that although
> > > > rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
> > > > avoid its being invoked, for example, by disabing interrupts across the
> > > > call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
> > > > do the disabling.
> > > > 
> > > > However, I could easily be missing something, especially given my being
> > > > confused by the juxtaposition of "notrace" and "usable to trace the
> > > > RCU implementation".  These appear to me to be contradicting each other.
> > > > 
> > > > Help?
> > > 
> > > You indeed need to ensure that everything that is called from
> > > rcu_{lock,unlock]_notrace don't end up executing instrumentation
> > > to prevent a circular dependency. You hinted at a few ways to achieve
> > > this. Other possible approaches:
> > > 
> > > - Add a "trace" bool parameter to rcu_read_unlock_special(),
> > > - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
> > 
> > OK, both of these are reasonable alternatives for the API, but it will
> > still be necessary to figure out how to make the notrace-incompatible
> > work happen.
> 
> One downside of those two approaches is that they require to somewhat
> duplicate the code (trace vs notrace). This makes it tricky in the
> case of irq work, because the irq work is just some interrupt, so we're
> limited in how we can pass around parameters that would use the
> notrace code.

Joel's patch, which is currently slated for the upcoming merge window,
should take care of the endless-IRQ-work recursion problem.  So the
main remaining issue is how rcu_read_unlock_special() should go about
safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
notrace mode.

> > > - Keep some nesting count in the task struct to prevent calling the
> > >    instrumentation when nested in notrace,
> > 
> > OK, for this one, is the idea to invoke some TBD RCU API the tracing
> > exits the notrace region?  I could see that working.  But there would
> > need to be a guarantee that if the notrace API was invoked, a call to
> > this TBD RCU API would follow in short order.  And I suspect that
> > preemption (and probably also interrupts) would need to be disabled
> > across this region.
> 
> No quite.
> 
> What I have in mind is to try to find the most elegant way to prevent
> endless recursion of the irq work issued immediately on
> rcu_read_unlock_notrace without slowing down most fast paths, and
> ideally without too much code duplication.
> 
> I'm not entirely sure what would be the best approach though.

Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
flag, so that it is cleared not in the IRQ-work handler, but
instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
rcu_read_unlock_special() from requeueing the IRQ-work handler until
after the previous request for a quiescent state has been satisfied.

So my main concern is again safely invoking raise_softirq_irqoff()
and irq_work_queue_on() when in notrace mode.

> > > There are probably other possible approaches I am missing, each with
> > > their respective trade offs.
> > 
> > I am pretty sure that we also have some ways to go before we have the
> > requirements fully laid out, for that matter.  ;-)
> > 
> > Could you please tell me where in the current tracing code these
> > rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
> 
> AFAIU here:
> 
> include/linux/tracepoint.h:
> 
> #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
> 
>         static inline void __do_trace_##name(proto)                     \
>         {                                                               \
>                 if (cond) {                                             \
>                         guard(preempt_notrace)();                       \
>                         __DO_TRACE_CALL(name, TP_ARGS(args));           \
>                 }                                                       \
>         }                                                               \
>         static inline void trace_##name(proto)                          \
>         {                                                               \
>                 if (static_branch_unlikely(&__tracepoint_##name.key))   \
>                         __do_trace_##name(args);                        \
>                 if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
>                         WARN_ONCE(!rcu_is_watching(),                   \
>                                   "RCU not watching for tracepoint");   \
>                 }                                                       \
>         }
> 
> and
> 
> #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
> 
>         static inline void __do_trace_##name(proto)                     \
>         {                                                               \
>                 guard(rcu_tasks_trace)();                               \
>                 __DO_TRACE_CALL(name, TP_ARGS(args));                   \
>         }                                                               \
>         static inline void trace_##name(proto)                          \
>         {                                                               \
>                 might_fault();                                          \
>                 if (static_branch_unlikely(&__tracepoint_##name.key))   \
>                         __do_trace_##name(args);                        \
>                 if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
>                         WARN_ONCE(!rcu_is_watching(),                   \
>                                   "RCU not watching for tracepoint");   \
>                 }                                                       \
>         }

I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
and guard(rcu_tasks_trace)().  Or is the idea to move the first to
guard(rcu_notrace)() in order to improve PREEMPT_RT latency?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-09 18:33                     ` Paul E. McKenney
@ 2025-07-11 13:46                       ` Mathieu Desnoyers
  2025-07-11 17:05                         ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-11 13:46 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-07-09 14:33, Paul E. McKenney wrote:
> On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-08 16:49, Paul E. McKenney wrote:
>>> On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
>>>> On 2025-07-07 17:56, Paul E. McKenney wrote:
>>>>> On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
>>>>>> On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
>>>>>>> On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
[...]
> 
>> I have nothing against an immediate IRQ-work, but I'm worried about
>> _nested_ immediate IRQ-work, where we end up triggering a endless
>> recursion of IRQ-work through instrumentation.
>>
>> Ideally we would want to figure out a way to prevent endless recursion,
>> while keeping the immediate IRQ-work within rcu_read_unlock_notrace()
>> to keep RT latency within bounded ranges, but without adding unwanted
>> overhead by adding too many conditional branches to the fast-paths.
>>
>> Is there a way to issue a given IRQ-work only when not nested in that
>> IRQ-work handler ? Any way we can detect and prevent that recursion
>> should work fine.
> 
> You are looking for this commit from Joel Fernandes:
> 
> 3284e4adca9b ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> 
> This is in the shiny-ish new-ish shared RCU tree at:
> 
> 	git@gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux
> 
> Or Message-Id <20250709104118.15532-6-neeraj.upadhyay@kernel.org>.

Whatever means of tracking whether we are already in an irq work
context and prevent emitting a recursive irq work should do.

Do I understand correctly that this commit takes care of preventing
recursive irq_work_queue_on() calls, but does not solve the issue of
recursive raise_softirq_irqoff() caused by tracepoint instrumentation ?

[...]

>>
>> As a side-note, I think the "_notrace" suffix I've described comes from
>> the "notrace" function attribute background, which is indeed also used
>> to prevent function tracing of the annotated functions, for similar
>> purposes.
>>
>> Kprobes also has annotation mechanisms to prevent inserting breakpoints
>> in specific functions, and in other cases we rely on compiler flags to
>> prevent instrumentation of entire objects.
>>
>> But mostly the goal of all of those mechanisms is the same: allow some
>> kernel code to be used from instrumentation and tracer callbacks without
>> triggering endless recursion.
> 
> OK, so is there some tracing anti-recursion flag in effect?
> Or is there something else that I must do before invoking either
> raise_softirq_irqoff() or irq_work_queue_on()?

Note that we have two scenarios:

- A nested scenario with a causality relationship, IOW circular
   dependency, which leads to endless recursion and a crash, and

- A nested scenario which is just the result of softirq, irq, nmi
   nesting,

In all of those scenarios, what we really care about here is to make
sure the outermost execution context emits the irq work to prevent
long latency, but we don't care if the nested contexts skip it.

> 
> Plus RCU does need *some* hint that it is not supposed to invoke
> rcu_preempt_deferred_qs_irqrestore() in this case.  Which might be
> why you are suggesting an rcu_read_unlock_notrace().

The rcu_read_unlock_notrace() approach removes RCU instrumentation, even
for the outermost nesting level. We'd be losing some RCU instrumentation
coverage there, but on the upside we would prevent emitting an RCU read
unlock tracepoint for every other tracepoint hit, which I think is good.

If instead we take an approach where we track in which instrumentation
nesting level we are within the task_struct, then we can skip calls to
rcu_preempt_deferred_qs_irqrestore() in all nested contexts, but keep
calling it in the outermost context.

But I would tend to favor the notrace approach so we don't emit
semi-useless RCU read lock/unlock events for every other tracepoint
hit. It would clutter the trace.

> 
>>>>> One possible path forward is to ensure that rcu_read_unlock_special()
>>>>> calls only functions that are compatible with the notrace/trace
>>>>> requirements.  The ones that look like they might need some help are
>>>>> raise_softirq_irqoff() and irq_work_queue_on().  Note that although
>>>>> rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
>>>>> avoid its being invoked, for example, by disabing interrupts across the
>>>>> call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
>>>>> do the disabling.
>>>>>
>>>>> However, I could easily be missing something, especially given my being
>>>>> confused by the juxtaposition of "notrace" and "usable to trace the
>>>>> RCU implementation".  These appear to me to be contradicting each other.
>>>>>
>>>>> Help?
>>>>
>>>> You indeed need to ensure that everything that is called from
>>>> rcu_{lock,unlock]_notrace don't end up executing instrumentation
>>>> to prevent a circular dependency. You hinted at a few ways to achieve
>>>> this. Other possible approaches:
>>>>
>>>> - Add a "trace" bool parameter to rcu_read_unlock_special(),
>>>> - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
>>>
>>> OK, both of these are reasonable alternatives for the API, but it will
>>> still be necessary to figure out how to make the notrace-incompatible
>>> work happen.
>>
>> One downside of those two approaches is that they require to somewhat
>> duplicate the code (trace vs notrace). This makes it tricky in the
>> case of irq work, because the irq work is just some interrupt, so we're
>> limited in how we can pass around parameters that would use the
>> notrace code.
> 
> Joel's patch, which is currently slated for the upcoming merge window,
> should take care of the endless-IRQ-work recursion problem.  So the
> main remaining issue is how rcu_read_unlock_special() should go about
> safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> notrace mode.

Are those two functions only needed from the outermost rcu_read_unlock
within a thread ? If yes, then keeping track of nesting level and
preventing those calls in nested contexts (per thread) should work.

> 
>>>> - Keep some nesting count in the task struct to prevent calling the
>>>>     instrumentation when nested in notrace,
>>>
>>> OK, for this one, is the idea to invoke some TBD RCU API the tracing
>>> exits the notrace region?  I could see that working.  But there would
>>> need to be a guarantee that if the notrace API was invoked, a call to
>>> this TBD RCU API would follow in short order.  And I suspect that
>>> preemption (and probably also interrupts) would need to be disabled
>>> across this region.
>>
>> No quite.
>>
>> What I have in mind is to try to find the most elegant way to prevent
>> endless recursion of the irq work issued immediately on
>> rcu_read_unlock_notrace without slowing down most fast paths, and
>> ideally without too much code duplication.
>>
>> I'm not entirely sure what would be the best approach though.
> 
> Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> flag, so that it is cleared not in the IRQ-work handler, but
> instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> rcu_read_unlock_special() from requeueing the IRQ-work handler until
> after the previous request for a quiescent state has been satisfied.
> 
> So my main concern is again safely invoking raise_softirq_irqoff()
> and irq_work_queue_on() when in notrace mode.

Would the nesting counter (per thread) approach suffice for your use
case ?

> 
>>>> There are probably other possible approaches I am missing, each with
>>>> their respective trade offs.
>>>
>>> I am pretty sure that we also have some ways to go before we have the
>>> requirements fully laid out, for that matter.  ;-)
>>>
>>> Could you please tell me where in the current tracing code these
>>> rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
>>
>> AFAIU here:
>>
>> include/linux/tracepoint.h:
>>
>> #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
>>
>>          static inline void __do_trace_##name(proto)                     \
>>          {                                                               \
>>                  if (cond) {                                             \
>>                          guard(preempt_notrace)();                       \
>>                          __DO_TRACE_CALL(name, TP_ARGS(args));           \
>>                  }                                                       \
>>          }                                                               \
>>          static inline void trace_##name(proto)                          \
>>          {                                                               \
>>                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>                          __do_trace_##name(args);                        \
>>                  if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
>>                          WARN_ONCE(!rcu_is_watching(),                   \
>>                                    "RCU not watching for tracepoint");   \
>>                  }                                                       \
>>          }
>>
>> and
>>
>> #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
>>
>>          static inline void __do_trace_##name(proto)                     \
>>          {                                                               \
>>                  guard(rcu_tasks_trace)();                               \
>>                  __DO_TRACE_CALL(name, TP_ARGS(args));                   \
>>          }                                                               \
>>          static inline void trace_##name(proto)                          \
>>          {                                                               \
>>                  might_fault();                                          \
>>                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>                          __do_trace_##name(args);                        \
>>                  if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
>>                          WARN_ONCE(!rcu_is_watching(),                   \
>>                                    "RCU not watching for tracepoint");   \
>>                  }                                                       \
>>          }
> 
> I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> guard(rcu_notrace)() in order to improve PREEMPT_RT latency?

AFAIU the goal here is to turn the guard(preempt_notrace)() into a
guard(rcu_notrace)() because the preempt-off critical sections don't
agree with BPF.

Thanks,

Mathieu

> 
> 							Thanx, Paul


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-11 13:46                       ` Mathieu Desnoyers
@ 2025-07-11 17:05                         ` Paul E. McKenney
  2025-07-14 16:34                           ` Paul E. McKenney
                                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-11 17:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-09 14:33, Paul E. McKenney wrote:
> > On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-08 16:49, Paul E. McKenney wrote:
> > > > On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
> > > > > On 2025-07-07 17:56, Paul E. McKenney wrote:
> > > > > > On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
> > > > > > > On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > > > > On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> [...]
> > 
> > > I have nothing against an immediate IRQ-work, but I'm worried about
> > > _nested_ immediate IRQ-work, where we end up triggering a endless
> > > recursion of IRQ-work through instrumentation.
> > > 
> > > Ideally we would want to figure out a way to prevent endless recursion,
> > > while keeping the immediate IRQ-work within rcu_read_unlock_notrace()
> > > to keep RT latency within bounded ranges, but without adding unwanted
> > > overhead by adding too many conditional branches to the fast-paths.
> > > 
> > > Is there a way to issue a given IRQ-work only when not nested in that
> > > IRQ-work handler ? Any way we can detect and prevent that recursion
> > > should work fine.
> > 
> > You are looking for this commit from Joel Fernandes:
> > 
> > 3284e4adca9b ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> > 
> > This is in the shiny-ish new-ish shared RCU tree at:
> > 
> > 	git@gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux
> > 
> > Or Message-Id <20250709104118.15532-6-neeraj.upadhyay@kernel.org>.
> 
> Whatever means of tracking whether we are already in an irq work
> context and prevent emitting a recursive irq work should do.
> 
> Do I understand correctly that this commit takes care of preventing
> recursive irq_work_queue_on() calls, but does not solve the issue of
> recursive raise_softirq_irqoff() caused by tracepoint instrumentation ?

Exactly!

In case you are testing this, we do have a new version of the above
commit with a bug fix:

2e154d164418 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")

> [...]
> 
> > > 
> > > As a side-note, I think the "_notrace" suffix I've described comes from
> > > the "notrace" function attribute background, which is indeed also used
> > > to prevent function tracing of the annotated functions, for similar
> > > purposes.
> > > 
> > > Kprobes also has annotation mechanisms to prevent inserting breakpoints
> > > in specific functions, and in other cases we rely on compiler flags to
> > > prevent instrumentation of entire objects.
> > > 
> > > But mostly the goal of all of those mechanisms is the same: allow some
> > > kernel code to be used from instrumentation and tracer callbacks without
> > > triggering endless recursion.
> > 
> > OK, so is there some tracing anti-recursion flag in effect?
> > Or is there something else that I must do before invoking either
> > raise_softirq_irqoff() or irq_work_queue_on()?
> 
> Note that we have two scenarios:
> 
> - A nested scenario with a causality relationship, IOW circular
>   dependency, which leads to endless recursion and a crash, and
> 
> - A nested scenario which is just the result of softirq, irq, nmi
>   nesting,
> 
> In all of those scenarios, what we really care about here is to make
> sure the outermost execution context emits the irq work to prevent
> long latency, but we don't care if the nested contexts skip it.

From what I can see, the above commit deals with the endless recursion,
but not with the calls to the non-notrace functions.  (If the calls to
non-notrace functions are in fact a problem?)

> > Plus RCU does need *some* hint that it is not supposed to invoke
> > rcu_preempt_deferred_qs_irqrestore() in this case.  Which might be
> > why you are suggesting an rcu_read_unlock_notrace().
> 
> The rcu_read_unlock_notrace() approach removes RCU instrumentation, even
> for the outermost nesting level. We'd be losing some RCU instrumentation
> coverage there, but on the upside we would prevent emitting an RCU read
> unlock tracepoint for every other tracepoint hit, which I think is good.
> 
> If instead we take an approach where we track in which instrumentation
> nesting level we are within the task_struct, then we can skip calls to
> rcu_preempt_deferred_qs_irqrestore() in all nested contexts, but keep
> calling it in the outermost context.
> 
> But I would tend to favor the notrace approach so we don't emit
> semi-useless RCU read lock/unlock events for every other tracepoint
> hit. It would clutter the trace.

But we still need to avoid grace-period hangs, needlessly high-latency
RCU expedited grace periods, and delayed RCU priority deboosts.

> > > > > > One possible path forward is to ensure that rcu_read_unlock_special()
> > > > > > calls only functions that are compatible with the notrace/trace
> > > > > > requirements.  The ones that look like they might need some help are
> > > > > > raise_softirq_irqoff() and irq_work_queue_on().  Note that although
> > > > > > rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
> > > > > > avoid its being invoked, for example, by disabing interrupts across the
> > > > > > call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
> > > > > > do the disabling.
> > > > > > 
> > > > > > However, I could easily be missing something, especially given my being
> > > > > > confused by the juxtaposition of "notrace" and "usable to trace the
> > > > > > RCU implementation".  These appear to me to be contradicting each other.
> > > > > > 
> > > > > > Help?
> > > > > 
> > > > > You indeed need to ensure that everything that is called from
> > > > > rcu_{lock,unlock]_notrace don't end up executing instrumentation
> > > > > to prevent a circular dependency. You hinted at a few ways to achieve
> > > > > this. Other possible approaches:
> > > > > 
> > > > > - Add a "trace" bool parameter to rcu_read_unlock_special(),
> > > > > - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
> > > > 
> > > > OK, both of these are reasonable alternatives for the API, but it will
> > > > still be necessary to figure out how to make the notrace-incompatible
> > > > work happen.
> > > 
> > > One downside of those two approaches is that they require to somewhat
> > > duplicate the code (trace vs notrace). This makes it tricky in the
> > > case of irq work, because the irq work is just some interrupt, so we're
> > > limited in how we can pass around parameters that would use the
> > > notrace code.
> > 
> > Joel's patch, which is currently slated for the upcoming merge window,
> > should take care of the endless-IRQ-work recursion problem.  So the
> > main remaining issue is how rcu_read_unlock_special() should go about
> > safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> > notrace mode.
> 
> Are those two functions only needed from the outermost rcu_read_unlock
> within a thread ? If yes, then keeping track of nesting level and
> preventing those calls in nested contexts (per thread) should work.

You lost me on this one.

Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
And we already have a nexting counter, current->rcu_read_lock_nesting.

However...

If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
some contexts we absolutely need to invoke the raise_softirq_irqoff()
and irq_work_queue_on() functions, both of which are notrace functions.

Or are you telling me that it is OK for a rcu_read_unlock_notrace()
to directly call these non-notrace functions?

> > > > > - Keep some nesting count in the task struct to prevent calling the
> > > > >     instrumentation when nested in notrace,
> > > > 
> > > > OK, for this one, is the idea to invoke some TBD RCU API the tracing
> > > > exits the notrace region?  I could see that working.  But there would
> > > > need to be a guarantee that if the notrace API was invoked, a call to
> > > > this TBD RCU API would follow in short order.  And I suspect that
> > > > preemption (and probably also interrupts) would need to be disabled
> > > > across this region.
> > > 
> > > No quite.
> > > 
> > > What I have in mind is to try to find the most elegant way to prevent
> > > endless recursion of the irq work issued immediately on
> > > rcu_read_unlock_notrace without slowing down most fast paths, and
> > > ideally without too much code duplication.
> > > 
> > > I'm not entirely sure what would be the best approach though.
> > 
> > Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> > flag, so that it is cleared not in the IRQ-work handler, but
> > instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> > rcu_read_unlock_special() from requeueing the IRQ-work handler until
> > after the previous request for a quiescent state has been satisfied.
> > 
> > So my main concern is again safely invoking raise_softirq_irqoff()
> > and irq_work_queue_on() when in notrace mode.
> 
> Would the nesting counter (per thread) approach suffice for your use
> case ?

Over and above the t->rcu_read_lock_nesting that we already use?
As in only the outermost rcu_read_unlock{,_notrace}() will invoke
rcu_read_unlock_special().

OK, let's look at a couple of scenarios.

First, suppose that we apply Joel's patch above, and someone sets a trace
point in task context outside of any RCU read-side critical section.
Suppose further that this task is preempted in the tracepoint's RCU
read-side critical section, and that RCU priority boosting is applied.

This trace point will invoke rcu_read_unlock{,_notrace}(), which
will in turn invoke rcu_read_unlock_special(), which will in turn
will note that preemption, interrupts, and softirqs are all enabled.
It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
a non-notrace function, which can in turn invoke all sorts of interesting
functions involving locking, the scheduler, ...

Is this OK, or should I set some sort of tracepoint recursion flag?

Second, suppose that we apply Joel's patch above, and someone sets a trace
point in task context outside of an RCU read-side critical section, but in
an preemption-disabled region of code.  Suppose further that this code is
delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
interrupt sets t->rcu_read_unlock_special.b.need_qs to true.

This trace point will invoke rcu_read_unlock{,_notrace}(), which will
note that preemption is disabled.  If rcutree.use_softirq is set and
this task is blocking an expedited RCU grace period, it will directly
invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
it will directly invoke the non-notrace function irq_work_queue_on().

Is this OK, or should I set some sort of tracepoint recursion flag?

There are other scenarios, but they require interrupts to be disabled
across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
in the just-ended RCU read-side critical section.  It does not look to
me like tracing does this.  But I might be missing something.  If so,
we have more scenarios to think through.  ;-)

> > > > > There are probably other possible approaches I am missing, each with
> > > > > their respective trade offs.
> > > > 
> > > > I am pretty sure that we also have some ways to go before we have the
> > > > requirements fully laid out, for that matter.  ;-)
> > > > 
> > > > Could you please tell me where in the current tracing code these
> > > > rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
> > > 
> > > AFAIU here:
> > > 
> > > include/linux/tracepoint.h:
> > > 
> > > #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
> > > 
> > >          static inline void __do_trace_##name(proto)                     \
> > >          {                                                               \
> > >                  if (cond) {                                             \
> > >                          guard(preempt_notrace)();                       \
> > >                          __DO_TRACE_CALL(name, TP_ARGS(args));           \
> > >                  }                                                       \
> > >          }                                                               \
> > >          static inline void trace_##name(proto)                          \
> > >          {                                                               \
> > >                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > >                          __do_trace_##name(args);                        \
> > >                  if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
> > >                          WARN_ONCE(!rcu_is_watching(),                   \
> > >                                    "RCU not watching for tracepoint");   \
> > >                  }                                                       \
> > >          }
> > > 
> > > and
> > > 
> > > #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
> > > 
> > >          static inline void __do_trace_##name(proto)                     \
> > >          {                                                               \
> > >                  guard(rcu_tasks_trace)();                               \
> > >                  __DO_TRACE_CALL(name, TP_ARGS(args));                   \
> > >          }                                                               \
> > >          static inline void trace_##name(proto)                          \
> > >          {                                                               \
> > >                  might_fault();                                          \
> > >                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > >                          __do_trace_##name(args);                        \
> > >                  if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
> > >                          WARN_ONCE(!rcu_is_watching(),                   \
> > >                                    "RCU not watching for tracepoint");   \
> > >                  }                                                       \
> > >          }
> > 
> > I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> > and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> > guard(rcu_notrace)() in order to improve PREEMPT_RT latency?
> 
> AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> guard(rcu_notrace)() because the preempt-off critical sections don't
> agree with BPF.

OK, got it, thank you!

The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
least its share of entertainment, that is for sure.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-11 17:05                         ` Paul E. McKenney
@ 2025-07-14 16:34                           ` Paul E. McKenney
  2025-07-15 19:56                             ` Mathieu Desnoyers
  2025-07-15 19:54                           ` Mathieu Desnoyers
  2025-07-16 15:09                           ` Steven Rostedt
  2 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-14 16:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Fri, Jul 11, 2025 at 10:05:26AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:

[ . . . ]

> > AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> > guard(rcu_notrace)() because the preempt-off critical sections don't
> > agree with BPF.
> 
> OK, got it, thank you!
> 
> The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> least its share of entertainment, that is for sure.  ;-)

Is there a similar issue with CONFIG_PREEMPT_LAZY, given that in that
case rcu_read_unlock() maps to preempt_enable(), which also can invoke
the scheduler?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-11 17:05                         ` Paul E. McKenney
  2025-07-14 16:34                           ` Paul E. McKenney
@ 2025-07-15 19:54                           ` Mathieu Desnoyers
  2025-07-15 23:18                             ` Paul E. McKenney
  2025-07-16 15:09                           ` Steven Rostedt
  2 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-15 19:54 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-07-11 13:05, Paul E. McKenney wrote:
> On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-09 14:33, Paul E. McKenney wrote:
>>> On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
[...]
>>>
>>> Joel's patch, which is currently slated for the upcoming merge window,
>>> should take care of the endless-IRQ-work recursion problem.  So the
>>> main remaining issue is how rcu_read_unlock_special() should go about
>>> safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
>>> notrace mode.
>>
>> Are those two functions only needed from the outermost rcu_read_unlock
>> within a thread ? If yes, then keeping track of nesting level and
>> preventing those calls in nested contexts (per thread) should work.
> 
> You lost me on this one.
> 
> Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
> And we already have a nexting counter, current->rcu_read_lock_nesting.

But AFAIU those are invoked after decrementing the nesting counter,
right ? So any instrumentation call done within those functions may
end up doing a read-side critical section again.

> 
> However...
> 
> If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
> some contexts we absolutely need to invoke the raise_softirq_irqoff()
> and irq_work_queue_on() functions, both of which are notrace functions.

I guess you mean "both of which are non-notrace functions", otherwise
we would not be having this discussion.

> 
> Or are you telling me that it is OK for a rcu_read_unlock_notrace()
> to directly call these non-notrace functions?

What I am getting at is that it may be OK for the outermost nesting
level of rcu_read_unlock_notrace() to call those non-notrace functions,
but only if we manage to keep track of that nesting level while those
non-notrace functions are called.

So AFAIU one issue here is that when the non-notrace functions are
called, the nesting level is back to 0 already.

> 
>>>>>> - Keep some nesting count in the task struct to prevent calling the
>>>>>>      instrumentation when nested in notrace,
>>>>>
>>>>> OK, for this one, is the idea to invoke some TBD RCU API the tracing
>>>>> exits the notrace region?  I could see that working.  But there would
>>>>> need to be a guarantee that if the notrace API was invoked, a call to
>>>>> this TBD RCU API would follow in short order.  And I suspect that
>>>>> preemption (and probably also interrupts) would need to be disabled
>>>>> across this region.
>>>>
>>>> No quite.
>>>>
>>>> What I have in mind is to try to find the most elegant way to prevent
>>>> endless recursion of the irq work issued immediately on
>>>> rcu_read_unlock_notrace without slowing down most fast paths, and
>>>> ideally without too much code duplication.
>>>>
>>>> I'm not entirely sure what would be the best approach though.
>>>
>>> Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
>>> flag, so that it is cleared not in the IRQ-work handler, but
>>> instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
>>> rcu_read_unlock_special() from requeueing the IRQ-work handler until
>>> after the previous request for a quiescent state has been satisfied.
>>>
>>> So my main concern is again safely invoking raise_softirq_irqoff()
>>> and irq_work_queue_on() when in notrace mode.
>>
>> Would the nesting counter (per thread) approach suffice for your use
>> case ?
> 
> Over and above the t->rcu_read_lock_nesting that we already use?
> As in only the outermost rcu_read_unlock{,_notrace}() will invoke
> rcu_read_unlock_special().
> 
> OK, let's look at a couple of scenarios.
> 
> First, suppose that we apply Joel's patch above, and someone sets a trace
> point in task context outside of any RCU read-side critical section.
> Suppose further that this task is preempted in the tracepoint's RCU
> read-side critical section, and that RCU priority boosting is applied.
> 
> This trace point will invoke rcu_read_unlock{,_notrace}(), which
> will in turn invoke rcu_read_unlock_special(), which will in turn
> will note that preemption, interrupts, and softirqs are all enabled.
> It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
> a non-notrace function, which can in turn invoke all sorts of interesting
> functions involving locking, the scheduler, ...
> 
> Is this OK, or should I set some sort of tracepoint recursion flag?

Or somehow modify the semantic of t->rcu_read_lock_nesting if at all
possible. Rather than decrementing it first and then if 0 invoke
a rcu_read_unlock_special, it could perhaps invoke
rcu_read_unlock_special if the nesting counter is _about to be
decremented from 1 to 0_, and then decrement to 0. This would
hopefully prevent recursion.

But I may be entirely misunderstanding the whole problem. If so,
please let me know!

And if for some reason is really needs to be decremented before
calling rcu_read_unlock_special, then we can have the following:
when exiting the outermost critical section, it could be decremented
from 2 to 1, then call rcu_read_unlock_special, after which it's
decremented to 0. The outermost read lock increment would have to
be adapted accordingly. But this would add overhead on the fast-paths,
which may be frowned upon.

The idea here is to keep tracking the fact that we are within the
execution of rcu_read_unlock_special, so it does not call it again
recursively, even though we are technically not nested within a
read-side critical section anymore.

> 
> Second, suppose that we apply Joel's patch above, and someone sets a trace
> point in task context outside of an RCU read-side critical section, but in
> an preemption-disabled region of code.  Suppose further that this code is
> delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
> interrupt sets t->rcu_read_unlock_special.b.need_qs to true.
> 
> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> note that preemption is disabled.  If rcutree.use_softirq is set and
> this task is blocking an expedited RCU grace period, it will directly
> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> it will directly invoke the non-notrace function irq_work_queue_on().
> 
> Is this OK, or should I set some sort of tracepoint recursion flag?

Invoking instrumentation from the implementation of instrumentation
is a good recipe for endless recursion, so we'd need to check for
recursion somehow there as well AFAIU.

> 
> There are other scenarios, but they require interrupts to be disabled
> across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
> in the just-ended RCU read-side critical section.  It does not look to
> me like tracing does this.  But I might be missing something.  If so,
> we have more scenarios to think through.  ;-)

I don't see a good use-case for that kind of scenario though. But I may
simply be lacking imagination.

> 
>>>>>> There are probably other possible approaches I am missing, each with
>>>>>> their respective trade offs.
>>>>>
>>>>> I am pretty sure that we also have some ways to go before we have the
>>>>> requirements fully laid out, for that matter.  ;-)
>>>>>
>>>>> Could you please tell me where in the current tracing code these
>>>>> rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
>>>>
>>>> AFAIU here:
>>>>
>>>> include/linux/tracepoint.h:
>>>>
>>>> #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
>>>>
>>>>           static inline void __do_trace_##name(proto)                     \
>>>>           {                                                               \
>>>>                   if (cond) {                                             \
>>>>                           guard(preempt_notrace)();                       \
>>>>                           __DO_TRACE_CALL(name, TP_ARGS(args));           \
>>>>                   }                                                       \
>>>>           }                                                               \
>>>>           static inline void trace_##name(proto)                          \
>>>>           {                                                               \
>>>>                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>>>                           __do_trace_##name(args);                        \
>>>>                   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
>>>>                           WARN_ONCE(!rcu_is_watching(),                   \
>>>>                                     "RCU not watching for tracepoint");   \
>>>>                   }                                                       \
>>>>           }
>>>>
>>>> and
>>>>
>>>> #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
>>>>
>>>>           static inline void __do_trace_##name(proto)                     \
>>>>           {                                                               \
>>>>                   guard(rcu_tasks_trace)();                               \
>>>>                   __DO_TRACE_CALL(name, TP_ARGS(args));                   \
>>>>           }                                                               \
>>>>           static inline void trace_##name(proto)                          \
>>>>           {                                                               \
>>>>                   might_fault();                                          \
>>>>                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>>>                           __do_trace_##name(args);                        \
>>>>                   if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
>>>>                           WARN_ONCE(!rcu_is_watching(),                   \
>>>>                                     "RCU not watching for tracepoint");   \
>>>>                   }                                                       \
>>>>           }
>>>
>>> I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
>>> and guard(rcu_tasks_trace)().  Or is the idea to move the first to
>>> guard(rcu_notrace)() in order to improve PREEMPT_RT latency?
>>
>> AFAIU the goal here is to turn the guard(preempt_notrace)() into a
>> guard(rcu_notrace)() because the preempt-off critical sections don't
>> agree with BPF.
> 
> OK, got it, thank you!
> 
> The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> least its share of entertainment, that is for sure.  ;-)

There is indeed no shortage of entertainment when combining those rather
distinct sets of requirements. :)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-14 16:34                           ` Paul E. McKenney
@ 2025-07-15 19:56                             ` Mathieu Desnoyers
  2025-07-15 23:23                               ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-15 19:56 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On 2025-07-14 12:34, Paul E. McKenney wrote:
> On Fri, Jul 11, 2025 at 10:05:26AM -0700, Paul E. McKenney wrote:
>> On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> 
> [ . . . ]
> 
>>> AFAIU the goal here is to turn the guard(preempt_notrace)() into a
>>> guard(rcu_notrace)() because the preempt-off critical sections don't
>>> agree with BPF.
>>
>> OK, got it, thank you!
>>
>> The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
>> least its share of entertainment, that is for sure.  ;-)
> 
> Is there a similar issue with CONFIG_PREEMPT_LAZY, given that in that
> case rcu_read_unlock() maps to preempt_enable(), which also can invoke
> the scheduler?

I'll have to defer that question to someone who is more familiar than me
with the internals of CONFIG_PREEMPT_LAZY.

Thanks for your vote of confidence though! ;-)

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-15 19:54                           ` Mathieu Desnoyers
@ 2025-07-15 23:18                             ` Paul E. McKenney
  2025-07-16  0:42                               ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-15 23:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Tue, Jul 15, 2025 at 03:54:02PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-11 13:05, Paul E. McKenney wrote:
> > On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-09 14:33, Paul E. McKenney wrote:
> > > > On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:

First, do we have a well-defined repeat-by to reproduce this issue?

Failing that, do we have a well-defined problem statement?  This is
all I have thus far:

o	Disabling preemption at tracepoints causes problems for BPF.
	(I have not yet checked this with the BFP folks, but last I knew
	it was OK to attach BPF programs to preempt-disabled regions
	of code, but perhaps this issue is specific to PREEMPT_RT.
	Perhaps involving memory allocation.)

o	Thus, there is a desire to move tracepoints from using
	preempt_disable_notrace() to a new rcu_read_lock_notrace()
	whose required semantics I do not yet understand.  However,
	from a first-principles RCU perspective, it must not unduly
	delay expedited grace periods and RCU priority deboosting.
	It also must not starve normal grace periods.

o	We clearly need to avoid destructive recursion in both tracepoints
	and in BPF.

o	Apparently, the interval across which tracepoint/BPF recursion is
	destructive extends beyond the proposed rcu_read_lock_notrace()
	critical section.  (If so, how far beyond, and can the RCU reader
	be extended to cover the full extent?  If not, why do we have
	a problem?)

	The definition of __DECLARE_TRACE looks to me like the RCU
	reader does in fact cover the full extent of the region in
	which (finite) recursion is destructive, at least given
	Joel's aforementioned IRQ-work patch:

	2e154d164418 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")

	Without that patch, yes, life is recursively hard.  So recursively
	hard, in fact, that the recursion itself kills you before you
	have a chance to die.

	Except that, assuming that I understand this (ha!), we also need
	to prevent rcu_read_unlock_special() from directly invoking
	rcu_preempt_deferred_qs_irqrestore().  The usual PREEMPT_RT
	configuration won't ever invoke raise_softirq_irqoff(), but
	maybe other configurations will.  But there are similar issues
	with irq_work_queue_on().

We have some non-problems:

o	If rcu_read_unlock() or one of its friends is invoked with a
	scheduler lock held, then interrupts will be disabled, which
	will cause rcu_read_unlock_special() to defer its calls into
	the scheduler, for example, via IRQ work.

o	NMI safety is already provided.

Have you guys been working with anyone on the BPF team?  If so, I should
reach out to that person, if not, I should find someone in BPF-land to
reach out to.  They might have some useful feedback.

> [...]
> > > > 
> > > > Joel's patch, which is currently slated for the upcoming merge window,
> > > > should take care of the endless-IRQ-work recursion problem.  So the
> > > > main remaining issue is how rcu_read_unlock_special() should go about
> > > > safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> > > > notrace mode.
> > > 
> > > Are those two functions only needed from the outermost rcu_read_unlock
> > > within a thread ? If yes, then keeping track of nesting level and
> > > preventing those calls in nested contexts (per thread) should work.
> > 
> > You lost me on this one.
> > 
> > Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
> > And we already have a nexting counter, current->rcu_read_lock_nesting.
> 
> But AFAIU those are invoked after decrementing the nesting counter,
> right ? So any instrumentation call done within those functions may
> end up doing a read-side critical section again.

They are indeed invoked after decrementing ->rcu_read_lock_nesting.

> > However...
> > 
> > If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
> > some contexts we absolutely need to invoke the raise_softirq_irqoff()
> > and irq_work_queue_on() functions, both of which are notrace functions.
> 
> I guess you mean "both of which are non-notrace functions", otherwise
> we would not be having this discussion.
> 
> > 
> > Or are you telling me that it is OK for a rcu_read_unlock_notrace()
> > to directly call these non-notrace functions?
> 
> What I am getting at is that it may be OK for the outermost nesting
> level of rcu_read_unlock_notrace() to call those non-notrace functions,
> but only if we manage to keep track of that nesting level while those
> non-notrace functions are called.
> 
> So AFAIU one issue here is that when the non-notrace functions are
> called, the nesting level is back to 0 already.

So you are worried about something like this?

	rcu_read_unlock() -> rcu_read_unlock_special() ->
	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
	rcu_read_unlock() -> rcu_read_unlock_special() ->
	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
	rcu_read_unlock() -> rcu_read_unlock_special() ->
	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
	rcu_read_unlock() -> rcu_read_unlock_special() ->
	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->

And so on forever?

Ditto for irq_work_queue_on()?

> > > > > > > - Keep some nesting count in the task struct to prevent calling the
> > > > > > >      instrumentation when nested in notrace,
> > > > > > 
> > > > > > OK, for this one, is the idea to invoke some TBD RCU API the tracing
> > > > > > exits the notrace region?  I could see that working.  But there would
> > > > > > need to be a guarantee that if the notrace API was invoked, a call to
> > > > > > this TBD RCU API would follow in short order.  And I suspect that
> > > > > > preemption (and probably also interrupts) would need to be disabled
> > > > > > across this region.
> > > > > 
> > > > > No quite.
> > > > > 
> > > > > What I have in mind is to try to find the most elegant way to prevent
> > > > > endless recursion of the irq work issued immediately on
> > > > > rcu_read_unlock_notrace without slowing down most fast paths, and
> > > > > ideally without too much code duplication.
> > > > > 
> > > > > I'm not entirely sure what would be the best approach though.
> > > > 
> > > > Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> > > > flag, so that it is cleared not in the IRQ-work handler, but
> > > > instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> > > > rcu_read_unlock_special() from requeueing the IRQ-work handler until
> > > > after the previous request for a quiescent state has been satisfied.
> > > > 
> > > > So my main concern is again safely invoking raise_softirq_irqoff()
> > > > and irq_work_queue_on() when in notrace mode.
> > > 
> > > Would the nesting counter (per thread) approach suffice for your use
> > > case ?
> > 
> > Over and above the t->rcu_read_lock_nesting that we already use?
> > As in only the outermost rcu_read_unlock{,_notrace}() will invoke
> > rcu_read_unlock_special().
> > 
> > OK, let's look at a couple of scenarios.
> > 
> > First, suppose that we apply Joel's patch above, and someone sets a trace
> > point in task context outside of any RCU read-side critical section.
> > Suppose further that this task is preempted in the tracepoint's RCU
> > read-side critical section, and that RCU priority boosting is applied.
> > 
> > This trace point will invoke rcu_read_unlock{,_notrace}(), which
> > will in turn invoke rcu_read_unlock_special(), which will in turn
> > will note that preemption, interrupts, and softirqs are all enabled.
> > It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
> > a non-notrace function, which can in turn invoke all sorts of interesting
> > functions involving locking, the scheduler, ...
> > 
> > Is this OK, or should I set some sort of tracepoint recursion flag?
> 
> Or somehow modify the semantic of t->rcu_read_lock_nesting if at all
> possible. Rather than decrementing it first and then if 0 invoke
> a rcu_read_unlock_special, it could perhaps invoke
> rcu_read_unlock_special if the nesting counter is _about to be
> decremented from 1 to 0_, and then decrement to 0. This would
> hopefully prevent recursion.
> 
> But I may be entirely misunderstanding the whole problem. If so,
> please let me know!
> 
> And if for some reason is really needs to be decremented before
> calling rcu_read_unlock_special, then we can have the following:
> when exiting the outermost critical section, it could be decremented
> from 2 to 1, then call rcu_read_unlock_special, after which it's
> decremented to 0. The outermost read lock increment would have to
> be adapted accordingly. But this would add overhead on the fast-paths,
> which may be frowned upon.
> 
> The idea here is to keep tracking the fact that we are within the
> execution of rcu_read_unlock_special, so it does not call it again
> recursively, even though we are technically not nested within a
> read-side critical section anymore.

Heh!  Some years back, rcu_read_unlock() used to do exactly that.
This changed in 2020 with this commit:

5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")

Quoting the commit log: "it is no longer necessary for __rcu_read_unlock()
to set the nesting level negative."

No longer necessary until now, perhaps?

How to revert this puppy?  Let's see...

The addition of "depth" to rcu_exp_handler() can stay, but that last hunk
needs to be restored.  And the rcu_data structure's ->exp_deferred_qs
might need to come back.  Ah, but it is now ->cpu_no_qs.b.exp in that
same structure.  So should be fine.  (Famous last words.)

And rcu_dynticks_curr_cpu_in_eqs() is no longer with us.  I believe that
it is now named !rcu_is_watching_curr_cpu(), but Frederic would know best.

On to kernel/rcu/tree_plugin.h...

We need RCU_NEST_BIAS and RCU_NEST_NMAX back.  Do we want to revert
the change to __rcu_read_unlock() or just leave it alone (for the
performance benefit, miniscule though it might be) and create an
__rcu_read_unlock_notrace()?  The former is simpler, especially
from an rcutorture testing viewpoint.  So the former it is, unless
and until someone like Lai Jiangshan (already CCed) can show a
strong need.  This would require an rcu_read_unlock_notrace() and an
__rcu_read_unlock_notrace(), with near-duplicate code.  Not a disaster,
but let's do it only if we really need it.

So put rcu_preempt_read_exit() back the way it was, and ditto for
__rcu_read_unlock(), rcu_preempt_need_deferred_qs(), and
rcu_flavor_sched_clock_irq().

Note that RCU_NEST_PMAX stayed and is still checked in __rcu_read_lock(),
so that part remained.

Give or take the inevitable bugs.  Initial testing is in progress.

The initial patch may be found at the end of this email, but it should
be used for entertainment purposes only.

> > Second, suppose that we apply Joel's patch above, and someone sets a trace
> > point in task context outside of an RCU read-side critical section, but in
> > an preemption-disabled region of code.  Suppose further that this code is
> > delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
> > interrupt sets t->rcu_read_unlock_special.b.need_qs to true.
> > 
> > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > note that preemption is disabled.  If rcutree.use_softirq is set and
> > this task is blocking an expedited RCU grace period, it will directly
> > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > it will directly invoke the non-notrace function irq_work_queue_on().
> > 
> > Is this OK, or should I set some sort of tracepoint recursion flag?
> 
> Invoking instrumentation from the implementation of instrumentation
> is a good recipe for endless recursion, so we'd need to check for
> recursion somehow there as well AFAIU.

Agreed, it could get ugly.

> > There are other scenarios, but they require interrupts to be disabled
> > across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
> > in the just-ended RCU read-side critical section.  It does not look to
> > me like tracing does this.  But I might be missing something.  If so,
> > we have more scenarios to think through.  ;-)
> 
> I don't see a good use-case for that kind of scenario though. But I may
> simply be lacking imagination.

There was a time when there was an explicit rule against it, so yes,
they have existed in the past.  If they exist now, that is OK.

> > > > > > > There are probably other possible approaches I am missing, each with
> > > > > > > their respective trade offs.
> > > > > > 
> > > > > > I am pretty sure that we also have some ways to go before we have the
> > > > > > requirements fully laid out, for that matter.  ;-)
> > > > > > 
> > > > > > Could you please tell me where in the current tracing code these
> > > > > > rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
> > > > > 
> > > > > AFAIU here:
> > > > > 
> > > > > include/linux/tracepoint.h:
> > > > > 
> > > > > #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
> > > > > 
> > > > >           static inline void __do_trace_##name(proto)                     \
> > > > >           {                                                               \
> > > > >                   if (cond) {                                             \
> > > > >                           guard(preempt_notrace)();                       \
> > > > >                           __DO_TRACE_CALL(name, TP_ARGS(args));           \
> > > > >                   }                                                       \
> > > > >           }                                                               \
> > > > >           static inline void trace_##name(proto)                          \
> > > > >           {                                                               \
> > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > >                           __do_trace_##name(args);                        \
> > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
> > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > >                                     "RCU not watching for tracepoint");   \
> > > > >                   }                                                       \
> > > > >           }
> > > > > 
> > > > > and
> > > > > 
> > > > > #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
> > > > > 
> > > > >           static inline void __do_trace_##name(proto)                     \
> > > > >           {                                                               \
> > > > >                   guard(rcu_tasks_trace)();                               \
> > > > >                   __DO_TRACE_CALL(name, TP_ARGS(args));                   \
> > > > >           }                                                               \
> > > > >           static inline void trace_##name(proto)                          \
> > > > >           {                                                               \
> > > > >                   might_fault();                                          \
> > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > >                           __do_trace_##name(args);                        \
> > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
> > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > >                                     "RCU not watching for tracepoint");   \
> > > > >                   }                                                       \
> > > > >           }
> > > > 
> > > > I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> > > > and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> > > > guard(rcu_notrace)() in order to improve PREEMPT_RT latency?
> > > 
> > > AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> > > guard(rcu_notrace)() because the preempt-off critical sections don't
> > > agree with BPF.
> > 
> > OK, got it, thank you!
> > 
> > The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> > least its share of entertainment, that is for sure.  ;-)
> 
> There is indeed no shortage of entertainment when combining those rather
> distinct sets of requirements. :)

I am sure that Mr. Murphy has more entertainment in store for all of us.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 076ad61e42f4a..33bed40f2b024 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -805,8 +805,32 @@ static void rcu_exp_handler(void *unused)
 		return;
 	}
 
-	// Fourth and finally, negative nesting depth should not happen.
-	WARN_ON_ONCE(1);
+	/*
+	 * The final and least likely case is where the interrupted
+	 * code was just about to or just finished exiting the
+	 * RCU-preempt read-side critical section when using
+	 * rcu_read_unlock_notrace(), and no, we can't tell which.
+	 * So either way, set ->cpu_no_qs.b.exp to flag later code that
+	 * a quiescent state is required.
+	 *
+	 * If the CPU has preemption and softirq enabled (or if some
+	 * buggy no-trace RCU-preempt read-side critical section is
+	 * being used from idle), just invoke rcu_preempt_deferred_qs()
+	 * to immediately report the quiescent state.  We cannot use
+	 * rcu_read_unlock_special() because we are in an interrupt handler,
+	 * which will cause that function to take an early exit without
+	 * doing anything.
+	 *
+	 * Otherwise, force a context switch after the CPU enables everything.
+	 */
+	rdp->cpu_no_qs.b.exp = true;
+	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
+	    WARN_ON_ONCE(!rcu_is_watching_curr_cpu())) {
+		rcu_preempt_deferred_qs(t);
+	} else {
+		set_tsk_need_resched(t);
+		set_preempt_need_resched();
+	}
 }
 
 /*
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1ee0d34ec333a..4becfe51e0e14 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -383,6 +383,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 	return READ_ONCE(rnp->gp_tasks) != NULL;
 }
 
+/* Bias and limit values for ->rcu_read_lock_nesting. */
+#define RCU_NEST_BIAS INT_MAX
+#define RCU_NEST_NMAX (-INT_MAX / 2)
 /* limit value for ->rcu_read_lock_nesting. */
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
@@ -391,12 +394,10 @@ static void rcu_preempt_read_enter(void)
 	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
 }
 
-static int rcu_preempt_read_exit(void)
+static void rcu_preempt_read_exit(void)
 {
-	int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
 
-	WRITE_ONCE(current->rcu_read_lock_nesting, ret);
-	return ret;
+	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) - 1);
 }
 
 static void rcu_preempt_depth_set(int val)
@@ -431,16 +432,22 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	barrier();  // critical section before exit code.
-	if (rcu_preempt_read_exit() == 0) {
-		barrier();  // critical-section exit before .s check.
+	if (rcu_preempt_depth() != 1) {
+		rcu_preempt_read_exit();
+	} else {
+		barrier();  // critical section before exit code.
+		rcu_preempt_depth_set(-RCU_NEST_BIAS);
+		barrier();  // critical section before ->rcu_read_unlock_special load
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
+		barrier();  // ->rcu_read_unlock_special load before assignment
+		rcu_preempt_depth_set(0);
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		int rrln = rcu_preempt_depth();
 
-		WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
+		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
+		WARN_ON_ONCE(rrln > RCU_NEST_PMAX);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -601,7 +608,7 @@ static notrace bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
 	return (__this_cpu_read(rcu_data.cpu_no_qs.b.exp) ||
 		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       rcu_preempt_depth() == 0;
+	       rcu_preempt_depth() <= 0;
 }
 
 /*
@@ -755,8 +762,8 @@ static void rcu_flavor_sched_clock_irq(int user)
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
-	} else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
-		rcu_qs(); /* Report immediate QS. */
+	} else if (!rcu_preempt_depth()) {
+		rcu_qs(); /* On our way out anyway, so report immediate QS. */
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-15 19:56                             ` Mathieu Desnoyers
@ 2025-07-15 23:23                               ` Paul E. McKenney
  0 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-15 23:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Tue, Jul 15, 2025 at 03:56:06PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-14 12:34, Paul E. McKenney wrote:
> > On Fri, Jul 11, 2025 at 10:05:26AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > > AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> > > > guard(rcu_notrace)() because the preempt-off critical sections don't
> > > > agree with BPF.
> > > 
> > > OK, got it, thank you!
> > > 
> > > The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> > > least its share of entertainment, that is for sure.  ;-)
> > 
> > Is there a similar issue with CONFIG_PREEMPT_LAZY, given that in that
> > case rcu_read_unlock() maps to preempt_enable(), which also can invoke
> > the scheduler?
> 
> I'll have to defer that question to someone who is more familiar than me
> with the internals of CONFIG_PREEMPT_LAZY.
> 
> Thanks for your vote of confidence though! ;-)

Mathieu, we always have confidence in you!  ;-)

Building with CONFIG_PREEMPT_LAZY=y combines a quasi-preemptible kernel
with a non-preemptible RCU.  Which has rcu_read_unlock() defined as
preempt_disable().  There might need to be an rcu_read_unlock_notrace()
defined as preempt_disable_notrace(), given that __DECLARE_TRACE()
currently sees fit to use guard(preempt_notrace)().

Or am I overthinking this?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-15 23:18                             ` Paul E. McKenney
@ 2025-07-16  0:42                               ` Paul E. McKenney
  2025-07-16  4:41                                 ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-16  0:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Tue, Jul 15, 2025 at 04:18:45PM -0700, Paul E. McKenney wrote:
> On Tue, Jul 15, 2025 at 03:54:02PM -0400, Mathieu Desnoyers wrote:
> > On 2025-07-11 13:05, Paul E. McKenney wrote:
> > > On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> > > > On 2025-07-09 14:33, Paul E. McKenney wrote:
> > > > > On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
> 
> First, do we have a well-defined repeat-by to reproduce this issue?
> 
> Failing that, do we have a well-defined problem statement?  This is
> all I have thus far:
> 
> o	Disabling preemption at tracepoints causes problems for BPF.
> 	(I have not yet checked this with the BFP folks, but last I knew
> 	it was OK to attach BPF programs to preempt-disabled regions
> 	of code, but perhaps this issue is specific to PREEMPT_RT.
> 	Perhaps involving memory allocation.)

However, I did run the following BPF program on an ARM server:

	bpftrace -e 'kfunc:rcu_note_context_switch { @func = count(); }'

This worked just fine, despite the fact that rcu_note_context_switch()
is invoked not just with preemption disabled, but also with interrupts
disabled.  This is admittedly not a CONFIG_PREEMPT_RT=y kernel, but it
certainly supports my belief that BPF programs are intended to be able
to be attached to preempt-disabled code.

So please help me out here.  Exactly what breaks due to that
guard(preempt_notrace)() in __DECLARE_TRACE()?

(My guess?  BPF programs are required to be preemptible in kernels built
with CONFIG_PREEMPT_RT(), and the bit about attaching BPF programs to
non-preemptible code didn't come up.  Hey, it sounds like something
I might do...)

							Thanx, Paul

> o	Thus, there is a desire to move tracepoints from using
> 	preempt_disable_notrace() to a new rcu_read_lock_notrace()
> 	whose required semantics I do not yet understand.  However,
> 	from a first-principles RCU perspective, it must not unduly
> 	delay expedited grace periods and RCU priority deboosting.
> 	It also must not starve normal grace periods.
> 
> o	We clearly need to avoid destructive recursion in both tracepoints
> 	and in BPF.
> 
> o	Apparently, the interval across which tracepoint/BPF recursion is
> 	destructive extends beyond the proposed rcu_read_lock_notrace()
> 	critical section.  (If so, how far beyond, and can the RCU reader
> 	be extended to cover the full extent?  If not, why do we have
> 	a problem?)
> 
> 	The definition of __DECLARE_TRACE looks to me like the RCU
> 	reader does in fact cover the full extent of the region in
> 	which (finite) recursion is destructive, at least given
> 	Joel's aforementioned IRQ-work patch:
> 
> 	2e154d164418 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> 
> 	Without that patch, yes, life is recursively hard.  So recursively
> 	hard, in fact, that the recursion itself kills you before you
> 	have a chance to die.
> 
> 	Except that, assuming that I understand this (ha!), we also need
> 	to prevent rcu_read_unlock_special() from directly invoking
> 	rcu_preempt_deferred_qs_irqrestore().  The usual PREEMPT_RT
> 	configuration won't ever invoke raise_softirq_irqoff(), but
> 	maybe other configurations will.  But there are similar issues
> 	with irq_work_queue_on().
> 
> We have some non-problems:
> 
> o	If rcu_read_unlock() or one of its friends is invoked with a
> 	scheduler lock held, then interrupts will be disabled, which
> 	will cause rcu_read_unlock_special() to defer its calls into
> 	the scheduler, for example, via IRQ work.
> 
> o	NMI safety is already provided.
> 
> Have you guys been working with anyone on the BPF team?  If so, I should
> reach out to that person, if not, I should find someone in BPF-land to
> reach out to.  They might have some useful feedback.
> 
> > [...]
> > > > > 
> > > > > Joel's patch, which is currently slated for the upcoming merge window,
> > > > > should take care of the endless-IRQ-work recursion problem.  So the
> > > > > main remaining issue is how rcu_read_unlock_special() should go about
> > > > > safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> > > > > notrace mode.
> > > > 
> > > > Are those two functions only needed from the outermost rcu_read_unlock
> > > > within a thread ? If yes, then keeping track of nesting level and
> > > > preventing those calls in nested contexts (per thread) should work.
> > > 
> > > You lost me on this one.
> > > 
> > > Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
> > > And we already have a nexting counter, current->rcu_read_lock_nesting.
> > 
> > But AFAIU those are invoked after decrementing the nesting counter,
> > right ? So any instrumentation call done within those functions may
> > end up doing a read-side critical section again.
> 
> They are indeed invoked after decrementing ->rcu_read_lock_nesting.
> 
> > > However...
> > > 
> > > If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
> > > some contexts we absolutely need to invoke the raise_softirq_irqoff()
> > > and irq_work_queue_on() functions, both of which are notrace functions.
> > 
> > I guess you mean "both of which are non-notrace functions", otherwise
> > we would not be having this discussion.
> > 
> > > 
> > > Or are you telling me that it is OK for a rcu_read_unlock_notrace()
> > > to directly call these non-notrace functions?
> > 
> > What I am getting at is that it may be OK for the outermost nesting
> > level of rcu_read_unlock_notrace() to call those non-notrace functions,
> > but only if we manage to keep track of that nesting level while those
> > non-notrace functions are called.
> > 
> > So AFAIU one issue here is that when the non-notrace functions are
> > called, the nesting level is back to 0 already.
> 
> So you are worried about something like this?
> 
> 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> 
> And so on forever?
> 
> Ditto for irq_work_queue_on()?
> 
> > > > > > > > - Keep some nesting count in the task struct to prevent calling the
> > > > > > > >      instrumentation when nested in notrace,
> > > > > > > 
> > > > > > > OK, for this one, is the idea to invoke some TBD RCU API the tracing
> > > > > > > exits the notrace region?  I could see that working.  But there would
> > > > > > > need to be a guarantee that if the notrace API was invoked, a call to
> > > > > > > this TBD RCU API would follow in short order.  And I suspect that
> > > > > > > preemption (and probably also interrupts) would need to be disabled
> > > > > > > across this region.
> > > > > > 
> > > > > > No quite.
> > > > > > 
> > > > > > What I have in mind is to try to find the most elegant way to prevent
> > > > > > endless recursion of the irq work issued immediately on
> > > > > > rcu_read_unlock_notrace without slowing down most fast paths, and
> > > > > > ideally without too much code duplication.
> > > > > > 
> > > > > > I'm not entirely sure what would be the best approach though.
> > > > > 
> > > > > Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> > > > > flag, so that it is cleared not in the IRQ-work handler, but
> > > > > instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> > > > > rcu_read_unlock_special() from requeueing the IRQ-work handler until
> > > > > after the previous request for a quiescent state has been satisfied.
> > > > > 
> > > > > So my main concern is again safely invoking raise_softirq_irqoff()
> > > > > and irq_work_queue_on() when in notrace mode.
> > > > 
> > > > Would the nesting counter (per thread) approach suffice for your use
> > > > case ?
> > > 
> > > Over and above the t->rcu_read_lock_nesting that we already use?
> > > As in only the outermost rcu_read_unlock{,_notrace}() will invoke
> > > rcu_read_unlock_special().
> > > 
> > > OK, let's look at a couple of scenarios.
> > > 
> > > First, suppose that we apply Joel's patch above, and someone sets a trace
> > > point in task context outside of any RCU read-side critical section.
> > > Suppose further that this task is preempted in the tracepoint's RCU
> > > read-side critical section, and that RCU priority boosting is applied.
> > > 
> > > This trace point will invoke rcu_read_unlock{,_notrace}(), which
> > > will in turn invoke rcu_read_unlock_special(), which will in turn
> > > will note that preemption, interrupts, and softirqs are all enabled.
> > > It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
> > > a non-notrace function, which can in turn invoke all sorts of interesting
> > > functions involving locking, the scheduler, ...
> > > 
> > > Is this OK, or should I set some sort of tracepoint recursion flag?
> > 
> > Or somehow modify the semantic of t->rcu_read_lock_nesting if at all
> > possible. Rather than decrementing it first and then if 0 invoke
> > a rcu_read_unlock_special, it could perhaps invoke
> > rcu_read_unlock_special if the nesting counter is _about to be
> > decremented from 1 to 0_, and then decrement to 0. This would
> > hopefully prevent recursion.
> > 
> > But I may be entirely misunderstanding the whole problem. If so,
> > please let me know!
> > 
> > And if for some reason is really needs to be decremented before
> > calling rcu_read_unlock_special, then we can have the following:
> > when exiting the outermost critical section, it could be decremented
> > from 2 to 1, then call rcu_read_unlock_special, after which it's
> > decremented to 0. The outermost read lock increment would have to
> > be adapted accordingly. But this would add overhead on the fast-paths,
> > which may be frowned upon.
> > 
> > The idea here is to keep tracking the fact that we are within the
> > execution of rcu_read_unlock_special, so it does not call it again
> > recursively, even though we are technically not nested within a
> > read-side critical section anymore.
> 
> Heh!  Some years back, rcu_read_unlock() used to do exactly that.
> This changed in 2020 with this commit:
> 
> 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> 
> Quoting the commit log: "it is no longer necessary for __rcu_read_unlock()
> to set the nesting level negative."
> 
> No longer necessary until now, perhaps?
> 
> How to revert this puppy?  Let's see...
> 
> The addition of "depth" to rcu_exp_handler() can stay, but that last hunk
> needs to be restored.  And the rcu_data structure's ->exp_deferred_qs
> might need to come back.  Ah, but it is now ->cpu_no_qs.b.exp in that
> same structure.  So should be fine.  (Famous last words.)
> 
> And rcu_dynticks_curr_cpu_in_eqs() is no longer with us.  I believe that
> it is now named !rcu_is_watching_curr_cpu(), but Frederic would know best.
> 
> On to kernel/rcu/tree_plugin.h...
> 
> We need RCU_NEST_BIAS and RCU_NEST_NMAX back.  Do we want to revert
> the change to __rcu_read_unlock() or just leave it alone (for the
> performance benefit, miniscule though it might be) and create an
> __rcu_read_unlock_notrace()?  The former is simpler, especially
> from an rcutorture testing viewpoint.  So the former it is, unless
> and until someone like Lai Jiangshan (already CCed) can show a
> strong need.  This would require an rcu_read_unlock_notrace() and an
> __rcu_read_unlock_notrace(), with near-duplicate code.  Not a disaster,
> but let's do it only if we really need it.
> 
> So put rcu_preempt_read_exit() back the way it was, and ditto for
> __rcu_read_unlock(), rcu_preempt_need_deferred_qs(), and
> rcu_flavor_sched_clock_irq().
> 
> Note that RCU_NEST_PMAX stayed and is still checked in __rcu_read_lock(),
> so that part remained.
> 
> Give or take the inevitable bugs.  Initial testing is in progress.
> 
> The initial patch may be found at the end of this email, but it should
> be used for entertainment purposes only.
> 
> > > Second, suppose that we apply Joel's patch above, and someone sets a trace
> > > point in task context outside of an RCU read-side critical section, but in
> > > an preemption-disabled region of code.  Suppose further that this code is
> > > delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
> > > interrupt sets t->rcu_read_unlock_special.b.need_qs to true.
> > > 
> > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > note that preemption is disabled.  If rcutree.use_softirq is set and
> > > this task is blocking an expedited RCU grace period, it will directly
> > > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > > it will directly invoke the non-notrace function irq_work_queue_on().
> > > 
> > > Is this OK, or should I set some sort of tracepoint recursion flag?
> > 
> > Invoking instrumentation from the implementation of instrumentation
> > is a good recipe for endless recursion, so we'd need to check for
> > recursion somehow there as well AFAIU.
> 
> Agreed, it could get ugly.
> 
> > > There are other scenarios, but they require interrupts to be disabled
> > > across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
> > > in the just-ended RCU read-side critical section.  It does not look to
> > > me like tracing does this.  But I might be missing something.  If so,
> > > we have more scenarios to think through.  ;-)
> > 
> > I don't see a good use-case for that kind of scenario though. But I may
> > simply be lacking imagination.
> 
> There was a time when there was an explicit rule against it, so yes,
> they have existed in the past.  If they exist now, that is OK.
> 
> > > > > > > > There are probably other possible approaches I am missing, each with
> > > > > > > > their respective trade offs.
> > > > > > > 
> > > > > > > I am pretty sure that we also have some ways to go before we have the
> > > > > > > requirements fully laid out, for that matter.  ;-)
> > > > > > > 
> > > > > > > Could you please tell me where in the current tracing code these
> > > > > > > rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
> > > > > > 
> > > > > > AFAIU here:
> > > > > > 
> > > > > > include/linux/tracepoint.h:
> > > > > > 
> > > > > > #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
> > > > > > 
> > > > > >           static inline void __do_trace_##name(proto)                     \
> > > > > >           {                                                               \
> > > > > >                   if (cond) {                                             \
> > > > > >                           guard(preempt_notrace)();                       \
> > > > > >                           __DO_TRACE_CALL(name, TP_ARGS(args));           \
> > > > > >                   }                                                       \
> > > > > >           }                                                               \
> > > > > >           static inline void trace_##name(proto)                          \
> > > > > >           {                                                               \
> > > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > > >                           __do_trace_##name(args);                        \
> > > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
> > > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > > >                                     "RCU not watching for tracepoint");   \
> > > > > >                   }                                                       \
> > > > > >           }
> > > > > > 
> > > > > > and
> > > > > > 
> > > > > > #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
> > > > > > 
> > > > > >           static inline void __do_trace_##name(proto)                     \
> > > > > >           {                                                               \
> > > > > >                   guard(rcu_tasks_trace)();                               \
> > > > > >                   __DO_TRACE_CALL(name, TP_ARGS(args));                   \
> > > > > >           }                                                               \
> > > > > >           static inline void trace_##name(proto)                          \
> > > > > >           {                                                               \
> > > > > >                   might_fault();                                          \
> > > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > > >                           __do_trace_##name(args);                        \
> > > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
> > > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > > >                                     "RCU not watching for tracepoint");   \
> > > > > >                   }                                                       \
> > > > > >           }
> > > > > 
> > > > > I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> > > > > and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> > > > > guard(rcu_notrace)() in order to improve PREEMPT_RT latency?
> > > > 
> > > > AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> > > > guard(rcu_notrace)() because the preempt-off critical sections don't
> > > > agree with BPF.
> > > 
> > > OK, got it, thank you!
> > > 
> > > The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> > > least its share of entertainment, that is for sure.  ;-)
> > 
> > There is indeed no shortage of entertainment when combining those rather
> > distinct sets of requirements. :)
> 
> I am sure that Mr. Murphy has more entertainment in store for all of us.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 076ad61e42f4a..33bed40f2b024 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -805,8 +805,32 @@ static void rcu_exp_handler(void *unused)
>  		return;
>  	}
>  
> -	// Fourth and finally, negative nesting depth should not happen.
> -	WARN_ON_ONCE(1);
> +	/*
> +	 * The final and least likely case is where the interrupted
> +	 * code was just about to or just finished exiting the
> +	 * RCU-preempt read-side critical section when using
> +	 * rcu_read_unlock_notrace(), and no, we can't tell which.
> +	 * So either way, set ->cpu_no_qs.b.exp to flag later code that
> +	 * a quiescent state is required.
> +	 *
> +	 * If the CPU has preemption and softirq enabled (or if some
> +	 * buggy no-trace RCU-preempt read-side critical section is
> +	 * being used from idle), just invoke rcu_preempt_deferred_qs()
> +	 * to immediately report the quiescent state.  We cannot use
> +	 * rcu_read_unlock_special() because we are in an interrupt handler,
> +	 * which will cause that function to take an early exit without
> +	 * doing anything.
> +	 *
> +	 * Otherwise, force a context switch after the CPU enables everything.
> +	 */
> +	rdp->cpu_no_qs.b.exp = true;
> +	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> +	    WARN_ON_ONCE(!rcu_is_watching_curr_cpu())) {
> +		rcu_preempt_deferred_qs(t);
> +	} else {
> +		set_tsk_need_resched(t);
> +		set_preempt_need_resched();
> +	}
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 1ee0d34ec333a..4becfe51e0e14 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -383,6 +383,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>  	return READ_ONCE(rnp->gp_tasks) != NULL;
>  }
>  
> +/* Bias and limit values for ->rcu_read_lock_nesting. */
> +#define RCU_NEST_BIAS INT_MAX
> +#define RCU_NEST_NMAX (-INT_MAX / 2)
>  /* limit value for ->rcu_read_lock_nesting. */
>  #define RCU_NEST_PMAX (INT_MAX / 2)
>  
> @@ -391,12 +394,10 @@ static void rcu_preempt_read_enter(void)
>  	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
>  }
>  
> -static int rcu_preempt_read_exit(void)
> +static void rcu_preempt_read_exit(void)
>  {
> -	int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
>  
> -	WRITE_ONCE(current->rcu_read_lock_nesting, ret);
> -	return ret;
> +	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) - 1);
>  }
>  
>  static void rcu_preempt_depth_set(int val)
> @@ -431,16 +432,22 @@ void __rcu_read_unlock(void)
>  {
>  	struct task_struct *t = current;
>  
> -	barrier();  // critical section before exit code.
> -	if (rcu_preempt_read_exit() == 0) {
> -		barrier();  // critical-section exit before .s check.
> +	if (rcu_preempt_depth() != 1) {
> +		rcu_preempt_read_exit();
> +	} else {
> +		barrier();  // critical section before exit code.
> +		rcu_preempt_depth_set(-RCU_NEST_BIAS);
> +		barrier();  // critical section before ->rcu_read_unlock_special load
>  		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>  			rcu_read_unlock_special(t);
> +		barrier();  // ->rcu_read_unlock_special load before assignment
> +		rcu_preempt_depth_set(0);
>  	}
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>  		int rrln = rcu_preempt_depth();
>  
> -		WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
> +		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
> +		WARN_ON_ONCE(rrln > RCU_NEST_PMAX);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> @@ -601,7 +608,7 @@ static notrace bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  {
>  	return (__this_cpu_read(rcu_data.cpu_no_qs.b.exp) ||
>  		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> -	       rcu_preempt_depth() == 0;
> +	       rcu_preempt_depth() <= 0;
>  }
>  
>  /*
> @@ -755,8 +762,8 @@ static void rcu_flavor_sched_clock_irq(int user)
>  	} else if (rcu_preempt_need_deferred_qs(t)) {
>  		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
>  		return;
> -	} else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
> -		rcu_qs(); /* Report immediate QS. */
> +	} else if (!rcu_preempt_depth()) {
> +		rcu_qs(); /* On our way out anyway, so report immediate QS. */
>  		return;
>  	}
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-16  0:42                               ` Paul E. McKenney
@ 2025-07-16  4:41                                 ` Paul E. McKenney
  0 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-16  4:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Steven Rostedt, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Tue, Jul 15, 2025 at 05:42:06PM -0700, Paul E. McKenney wrote:
> On Tue, Jul 15, 2025 at 04:18:45PM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 15, 2025 at 03:54:02PM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-11 13:05, Paul E. McKenney wrote:
> > > > On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
> > > > > On 2025-07-09 14:33, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:

And as an alternative to messing up Lai Jiangshan's preemptible-RCU
optimization work, why don't I instead provide you with an
srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace()?
These are faster than preemptible RCU on my laptop because they don't
need read-side smp_mb(), plus they avoid the array-indexing operations
used by srcu_read_lock() and srcu_read_unlock().

Much simpler patch, and if I recall correctly, __DECLARE_TRACE() used
to use srcu_read_lock() and srcu_read_unlock() anyway.

I am testing srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace()
over night, and with luck will post them tomorrow, Pacific Time.

							Thanx, Paul

> > First, do we have a well-defined repeat-by to reproduce this issue?
> > 
> > Failing that, do we have a well-defined problem statement?  This is
> > all I have thus far:
> > 
> > o	Disabling preemption at tracepoints causes problems for BPF.
> > 	(I have not yet checked this with the BFP folks, but last I knew
> > 	it was OK to attach BPF programs to preempt-disabled regions
> > 	of code, but perhaps this issue is specific to PREEMPT_RT.
> > 	Perhaps involving memory allocation.)
> 
> However, I did run the following BPF program on an ARM server:
> 
> 	bpftrace -e 'kfunc:rcu_note_context_switch { @func = count(); }'
> 
> This worked just fine, despite the fact that rcu_note_context_switch()
> is invoked not just with preemption disabled, but also with interrupts
> disabled.  This is admittedly not a CONFIG_PREEMPT_RT=y kernel, but it
> certainly supports my belief that BPF programs are intended to be able
> to be attached to preempt-disabled code.
> 
> So please help me out here.  Exactly what breaks due to that
> guard(preempt_notrace)() in __DECLARE_TRACE()?
> 
> (My guess?  BPF programs are required to be preemptible in kernels built
> with CONFIG_PREEMPT_RT(), and the bit about attaching BPF programs to
> non-preemptible code didn't come up.  Hey, it sounds like something
> I might do...)
> 
> 							Thanx, Paul
> 
> > o	Thus, there is a desire to move tracepoints from using
> > 	preempt_disable_notrace() to a new rcu_read_lock_notrace()
> > 	whose required semantics I do not yet understand.  However,
> > 	from a first-principles RCU perspective, it must not unduly
> > 	delay expedited grace periods and RCU priority deboosting.
> > 	It also must not starve normal grace periods.
> > 
> > o	We clearly need to avoid destructive recursion in both tracepoints
> > 	and in BPF.
> > 
> > o	Apparently, the interval across which tracepoint/BPF recursion is
> > 	destructive extends beyond the proposed rcu_read_lock_notrace()
> > 	critical section.  (If so, how far beyond, and can the RCU reader
> > 	be extended to cover the full extent?  If not, why do we have
> > 	a problem?)
> > 
> > 	The definition of __DECLARE_TRACE looks to me like the RCU
> > 	reader does in fact cover the full extent of the region in
> > 	which (finite) recursion is destructive, at least given
> > 	Joel's aforementioned IRQ-work patch:
> > 
> > 	2e154d164418 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> > 
> > 	Without that patch, yes, life is recursively hard.  So recursively
> > 	hard, in fact, that the recursion itself kills you before you
> > 	have a chance to die.
> > 
> > 	Except that, assuming that I understand this (ha!), we also need
> > 	to prevent rcu_read_unlock_special() from directly invoking
> > 	rcu_preempt_deferred_qs_irqrestore().  The usual PREEMPT_RT
> > 	configuration won't ever invoke raise_softirq_irqoff(), but
> > 	maybe other configurations will.  But there are similar issues
> > 	with irq_work_queue_on().
> > 
> > We have some non-problems:
> > 
> > o	If rcu_read_unlock() or one of its friends is invoked with a
> > 	scheduler lock held, then interrupts will be disabled, which
> > 	will cause rcu_read_unlock_special() to defer its calls into
> > 	the scheduler, for example, via IRQ work.
> > 
> > o	NMI safety is already provided.
> > 
> > Have you guys been working with anyone on the BPF team?  If so, I should
> > reach out to that person, if not, I should find someone in BPF-land to
> > reach out to.  They might have some useful feedback.
> > 
> > > [...]
> > > > > > 
> > > > > > Joel's patch, which is currently slated for the upcoming merge window,
> > > > > > should take care of the endless-IRQ-work recursion problem.  So the
> > > > > > main remaining issue is how rcu_read_unlock_special() should go about
> > > > > > safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> > > > > > notrace mode.
> > > > > 
> > > > > Are those two functions only needed from the outermost rcu_read_unlock
> > > > > within a thread ? If yes, then keeping track of nesting level and
> > > > > preventing those calls in nested contexts (per thread) should work.
> > > > 
> > > > You lost me on this one.
> > > > 
> > > > Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
> > > > And we already have a nexting counter, current->rcu_read_lock_nesting.
> > > 
> > > But AFAIU those are invoked after decrementing the nesting counter,
> > > right ? So any instrumentation call done within those functions may
> > > end up doing a read-side critical section again.
> > 
> > They are indeed invoked after decrementing ->rcu_read_lock_nesting.
> > 
> > > > However...
> > > > 
> > > > If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
> > > > some contexts we absolutely need to invoke the raise_softirq_irqoff()
> > > > and irq_work_queue_on() functions, both of which are notrace functions.
> > > 
> > > I guess you mean "both of which are non-notrace functions", otherwise
> > > we would not be having this discussion.
> > > 
> > > > 
> > > > Or are you telling me that it is OK for a rcu_read_unlock_notrace()
> > > > to directly call these non-notrace functions?
> > > 
> > > What I am getting at is that it may be OK for the outermost nesting
> > > level of rcu_read_unlock_notrace() to call those non-notrace functions,
> > > but only if we manage to keep track of that nesting level while those
> > > non-notrace functions are called.
> > > 
> > > So AFAIU one issue here is that when the non-notrace functions are
> > > called, the nesting level is back to 0 already.
> > 
> > So you are worried about something like this?
> > 
> > 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> > 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> > 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> > 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> > 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> > 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> > 	rcu_read_unlock() -> rcu_read_unlock_special() ->
> > 	rcu_preempt_deferred_qs_irqrestore() -> *tracepoint* ->
> > 
> > And so on forever?
> > 
> > Ditto for irq_work_queue_on()?
> > 
> > > > > > > > > - Keep some nesting count in the task struct to prevent calling the
> > > > > > > > >      instrumentation when nested in notrace,
> > > > > > > > 
> > > > > > > > OK, for this one, is the idea to invoke some TBD RCU API the tracing
> > > > > > > > exits the notrace region?  I could see that working.  But there would
> > > > > > > > need to be a guarantee that if the notrace API was invoked, a call to
> > > > > > > > this TBD RCU API would follow in short order.  And I suspect that
> > > > > > > > preemption (and probably also interrupts) would need to be disabled
> > > > > > > > across this region.
> > > > > > > 
> > > > > > > No quite.
> > > > > > > 
> > > > > > > What I have in mind is to try to find the most elegant way to prevent
> > > > > > > endless recursion of the irq work issued immediately on
> > > > > > > rcu_read_unlock_notrace without slowing down most fast paths, and
> > > > > > > ideally without too much code duplication.
> > > > > > > 
> > > > > > > I'm not entirely sure what would be the best approach though.
> > > > > > 
> > > > > > Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> > > > > > flag, so that it is cleared not in the IRQ-work handler, but
> > > > > > instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> > > > > > rcu_read_unlock_special() from requeueing the IRQ-work handler until
> > > > > > after the previous request for a quiescent state has been satisfied.
> > > > > > 
> > > > > > So my main concern is again safely invoking raise_softirq_irqoff()
> > > > > > and irq_work_queue_on() when in notrace mode.
> > > > > 
> > > > > Would the nesting counter (per thread) approach suffice for your use
> > > > > case ?
> > > > 
> > > > Over and above the t->rcu_read_lock_nesting that we already use?
> > > > As in only the outermost rcu_read_unlock{,_notrace}() will invoke
> > > > rcu_read_unlock_special().
> > > > 
> > > > OK, let's look at a couple of scenarios.
> > > > 
> > > > First, suppose that we apply Joel's patch above, and someone sets a trace
> > > > point in task context outside of any RCU read-side critical section.
> > > > Suppose further that this task is preempted in the tracepoint's RCU
> > > > read-side critical section, and that RCU priority boosting is applied.
> > > > 
> > > > This trace point will invoke rcu_read_unlock{,_notrace}(), which
> > > > will in turn invoke rcu_read_unlock_special(), which will in turn
> > > > will note that preemption, interrupts, and softirqs are all enabled.
> > > > It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
> > > > a non-notrace function, which can in turn invoke all sorts of interesting
> > > > functions involving locking, the scheduler, ...
> > > > 
> > > > Is this OK, or should I set some sort of tracepoint recursion flag?
> > > 
> > > Or somehow modify the semantic of t->rcu_read_lock_nesting if at all
> > > possible. Rather than decrementing it first and then if 0 invoke
> > > a rcu_read_unlock_special, it could perhaps invoke
> > > rcu_read_unlock_special if the nesting counter is _about to be
> > > decremented from 1 to 0_, and then decrement to 0. This would
> > > hopefully prevent recursion.
> > > 
> > > But I may be entirely misunderstanding the whole problem. If so,
> > > please let me know!
> > > 
> > > And if for some reason is really needs to be decremented before
> > > calling rcu_read_unlock_special, then we can have the following:
> > > when exiting the outermost critical section, it could be decremented
> > > from 2 to 1, then call rcu_read_unlock_special, after which it's
> > > decremented to 0. The outermost read lock increment would have to
> > > be adapted accordingly. But this would add overhead on the fast-paths,
> > > which may be frowned upon.
> > > 
> > > The idea here is to keep tracking the fact that we are within the
> > > execution of rcu_read_unlock_special, so it does not call it again
> > > recursively, even though we are technically not nested within a
> > > read-side critical section anymore.
> > 
> > Heh!  Some years back, rcu_read_unlock() used to do exactly that.
> > This changed in 2020 with this commit:
> > 
> > 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> > 
> > Quoting the commit log: "it is no longer necessary for __rcu_read_unlock()
> > to set the nesting level negative."
> > 
> > No longer necessary until now, perhaps?
> > 
> > How to revert this puppy?  Let's see...
> > 
> > The addition of "depth" to rcu_exp_handler() can stay, but that last hunk
> > needs to be restored.  And the rcu_data structure's ->exp_deferred_qs
> > might need to come back.  Ah, but it is now ->cpu_no_qs.b.exp in that
> > same structure.  So should be fine.  (Famous last words.)
> > 
> > And rcu_dynticks_curr_cpu_in_eqs() is no longer with us.  I believe that
> > it is now named !rcu_is_watching_curr_cpu(), but Frederic would know best.
> > 
> > On to kernel/rcu/tree_plugin.h...
> > 
> > We need RCU_NEST_BIAS and RCU_NEST_NMAX back.  Do we want to revert
> > the change to __rcu_read_unlock() or just leave it alone (for the
> > performance benefit, miniscule though it might be) and create an
> > __rcu_read_unlock_notrace()?  The former is simpler, especially
> > from an rcutorture testing viewpoint.  So the former it is, unless
> > and until someone like Lai Jiangshan (already CCed) can show a
> > strong need.  This would require an rcu_read_unlock_notrace() and an
> > __rcu_read_unlock_notrace(), with near-duplicate code.  Not a disaster,
> > but let's do it only if we really need it.
> > 
> > So put rcu_preempt_read_exit() back the way it was, and ditto for
> > __rcu_read_unlock(), rcu_preempt_need_deferred_qs(), and
> > rcu_flavor_sched_clock_irq().
> > 
> > Note that RCU_NEST_PMAX stayed and is still checked in __rcu_read_lock(),
> > so that part remained.
> > 
> > Give or take the inevitable bugs.  Initial testing is in progress.
> > 
> > The initial patch may be found at the end of this email, but it should
> > be used for entertainment purposes only.
> > 
> > > > Second, suppose that we apply Joel's patch above, and someone sets a trace
> > > > point in task context outside of an RCU read-side critical section, but in
> > > > an preemption-disabled region of code.  Suppose further that this code is
> > > > delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
> > > > interrupt sets t->rcu_read_unlock_special.b.need_qs to true.
> > > > 
> > > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > > note that preemption is disabled.  If rcutree.use_softirq is set and
> > > > this task is blocking an expedited RCU grace period, it will directly
> > > > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > > > it will directly invoke the non-notrace function irq_work_queue_on().
> > > > 
> > > > Is this OK, or should I set some sort of tracepoint recursion flag?
> > > 
> > > Invoking instrumentation from the implementation of instrumentation
> > > is a good recipe for endless recursion, so we'd need to check for
> > > recursion somehow there as well AFAIU.
> > 
> > Agreed, it could get ugly.
> > 
> > > > There are other scenarios, but they require interrupts to be disabled
> > > > across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
> > > > in the just-ended RCU read-side critical section.  It does not look to
> > > > me like tracing does this.  But I might be missing something.  If so,
> > > > we have more scenarios to think through.  ;-)
> > > 
> > > I don't see a good use-case for that kind of scenario though. But I may
> > > simply be lacking imagination.
> > 
> > There was a time when there was an explicit rule against it, so yes,
> > they have existed in the past.  If they exist now, that is OK.
> > 
> > > > > > > > > There are probably other possible approaches I am missing, each with
> > > > > > > > > their respective trade offs.
> > > > > > > > 
> > > > > > > > I am pretty sure that we also have some ways to go before we have the
> > > > > > > > requirements fully laid out, for that matter.  ;-)
> > > > > > > > 
> > > > > > > > Could you please tell me where in the current tracing code these
> > > > > > > > rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
> > > > > > > 
> > > > > > > AFAIU here:
> > > > > > > 
> > > > > > > include/linux/tracepoint.h:
> > > > > > > 
> > > > > > > #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
> > > > > > > 
> > > > > > >           static inline void __do_trace_##name(proto)                     \
> > > > > > >           {                                                               \
> > > > > > >                   if (cond) {                                             \
> > > > > > >                           guard(preempt_notrace)();                       \
> > > > > > >                           __DO_TRACE_CALL(name, TP_ARGS(args));           \
> > > > > > >                   }                                                       \
> > > > > > >           }                                                               \
> > > > > > >           static inline void trace_##name(proto)                          \
> > > > > > >           {                                                               \
> > > > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > > > >                           __do_trace_##name(args);                        \
> > > > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
> > > > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > > > >                                     "RCU not watching for tracepoint");   \
> > > > > > >                   }                                                       \
> > > > > > >           }
> > > > > > > 
> > > > > > > and
> > > > > > > 
> > > > > > > #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
> > > > > > > 
> > > > > > >           static inline void __do_trace_##name(proto)                     \
> > > > > > >           {                                                               \
> > > > > > >                   guard(rcu_tasks_trace)();                               \
> > > > > > >                   __DO_TRACE_CALL(name, TP_ARGS(args));                   \
> > > > > > >           }                                                               \
> > > > > > >           static inline void trace_##name(proto)                          \
> > > > > > >           {                                                               \
> > > > > > >                   might_fault();                                          \
> > > > > > >                   if (static_branch_unlikely(&__tracepoint_##name.key))   \
> > > > > > >                           __do_trace_##name(args);                        \
> > > > > > >                   if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
> > > > > > >                           WARN_ONCE(!rcu_is_watching(),                   \
> > > > > > >                                     "RCU not watching for tracepoint");   \
> > > > > > >                   }                                                       \
> > > > > > >           }
> > > > > > 
> > > > > > I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> > > > > > and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> > > > > > guard(rcu_notrace)() in order to improve PREEMPT_RT latency?
> > > > > 
> > > > > AFAIU the goal here is to turn the guard(preempt_notrace)() into a
> > > > > guard(rcu_notrace)() because the preempt-off critical sections don't
> > > > > agree with BPF.
> > > > 
> > > > OK, got it, thank you!
> > > > 
> > > > The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
> > > > least its share of entertainment, that is for sure.  ;-)
> > > 
> > > There is indeed no shortage of entertainment when combining those rather
> > > distinct sets of requirements. :)
> > 
> > I am sure that Mr. Murphy has more entertainment in store for all of us.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 076ad61e42f4a..33bed40f2b024 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -805,8 +805,32 @@ static void rcu_exp_handler(void *unused)
> >  		return;
> >  	}
> >  
> > -	// Fourth and finally, negative nesting depth should not happen.
> > -	WARN_ON_ONCE(1);
> > +	/*
> > +	 * The final and least likely case is where the interrupted
> > +	 * code was just about to or just finished exiting the
> > +	 * RCU-preempt read-side critical section when using
> > +	 * rcu_read_unlock_notrace(), and no, we can't tell which.
> > +	 * So either way, set ->cpu_no_qs.b.exp to flag later code that
> > +	 * a quiescent state is required.
> > +	 *
> > +	 * If the CPU has preemption and softirq enabled (or if some
> > +	 * buggy no-trace RCU-preempt read-side critical section is
> > +	 * being used from idle), just invoke rcu_preempt_deferred_qs()
> > +	 * to immediately report the quiescent state.  We cannot use
> > +	 * rcu_read_unlock_special() because we are in an interrupt handler,
> > +	 * which will cause that function to take an early exit without
> > +	 * doing anything.
> > +	 *
> > +	 * Otherwise, force a context switch after the CPU enables everything.
> > +	 */
> > +	rdp->cpu_no_qs.b.exp = true;
> > +	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > +	    WARN_ON_ONCE(!rcu_is_watching_curr_cpu())) {
> > +		rcu_preempt_deferred_qs(t);
> > +	} else {
> > +		set_tsk_need_resched(t);
> > +		set_preempt_need_resched();
> > +	}
> >  }
> >  
> >  /*
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 1ee0d34ec333a..4becfe51e0e14 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -383,6 +383,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> >  	return READ_ONCE(rnp->gp_tasks) != NULL;
> >  }
> >  
> > +/* Bias and limit values for ->rcu_read_lock_nesting. */
> > +#define RCU_NEST_BIAS INT_MAX
> > +#define RCU_NEST_NMAX (-INT_MAX / 2)
> >  /* limit value for ->rcu_read_lock_nesting. */
> >  #define RCU_NEST_PMAX (INT_MAX / 2)
> >  
> > @@ -391,12 +394,10 @@ static void rcu_preempt_read_enter(void)
> >  	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
> >  }
> >  
> > -static int rcu_preempt_read_exit(void)
> > +static void rcu_preempt_read_exit(void)
> >  {
> > -	int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
> >  
> > -	WRITE_ONCE(current->rcu_read_lock_nesting, ret);
> > -	return ret;
> > +	WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) - 1);
> >  }
> >  
> >  static void rcu_preempt_depth_set(int val)
> > @@ -431,16 +432,22 @@ void __rcu_read_unlock(void)
> >  {
> >  	struct task_struct *t = current;
> >  
> > -	barrier();  // critical section before exit code.
> > -	if (rcu_preempt_read_exit() == 0) {
> > -		barrier();  // critical-section exit before .s check.
> > +	if (rcu_preempt_depth() != 1) {
> > +		rcu_preempt_read_exit();
> > +	} else {
> > +		barrier();  // critical section before exit code.
> > +		rcu_preempt_depth_set(-RCU_NEST_BIAS);
> > +		barrier();  // critical section before ->rcu_read_unlock_special load
> >  		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> >  			rcu_read_unlock_special(t);
> > +		barrier();  // ->rcu_read_unlock_special load before assignment
> > +		rcu_preempt_depth_set(0);
> >  	}
> >  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> >  		int rrln = rcu_preempt_depth();
> >  
> > -		WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
> > +		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
> > +		WARN_ON_ONCE(rrln > RCU_NEST_PMAX);
> >  	}
> >  }
> >  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > @@ -601,7 +608,7 @@ static notrace bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> >  {
> >  	return (__this_cpu_read(rcu_data.cpu_no_qs.b.exp) ||
> >  		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> > -	       rcu_preempt_depth() == 0;
> > +	       rcu_preempt_depth() <= 0;
> >  }
> >  
> >  /*
> > @@ -755,8 +762,8 @@ static void rcu_flavor_sched_clock_irq(int user)
> >  	} else if (rcu_preempt_need_deferred_qs(t)) {
> >  		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
> >  		return;
> > -	} else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
> > -		rcu_qs(); /* Report immediate QS. */
> > +	} else if (!rcu_preempt_depth()) {
> > +		rcu_qs(); /* On our way out anyway, so report immediate QS. */
> >  		return;
> >  	}
> >  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-11 17:05                         ` Paul E. McKenney
  2025-07-14 16:34                           ` Paul E. McKenney
  2025-07-15 19:54                           ` Mathieu Desnoyers
@ 2025-07-16 15:09                           ` Steven Rostedt
  2025-07-16 20:35                             ` Paul E. McKenney
  2 siblings, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2025-07-16 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang

On Fri, 11 Jul 2025 10:05:26 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> note that preemption is disabled.  If rcutree.use_softirq is set and
> this task is blocking an expedited RCU grace period, it will directly
> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> it will directly invoke the non-notrace function irq_work_queue_on().

Just to clarify some things; A function annotated by "notrace" simply
will not have the ftrace hook to that function, but that function may
very well have tracing triggered inside of it.

Functions with "_notrace" in its name (like preempt_disable_notrace())
should not have any tracing instrumentation (as Mathieu stated)
inside of it, so that it can be used in the tracing infrastructure.

raise_softirq_irqoff() has a tracepoint inside of it. If we have the
tracing infrastructure call that, and we happen to enable that
tracepoint, we will have:

  raise_softirq_irqoff()
     trace_softirq_raise()
       [..]
         raise_softirq_irqoff()
            trace_softirq_raise()
               [..]
                 Ad infinitum!

I'm not sure if that's what is being proposed or not, but I just wanted
to make sure everyone is aware of the above.

-- Steve
-

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-16 15:09                           ` Steven Rostedt
@ 2025-07-16 20:35                             ` Paul E. McKenney
  2025-07-16 22:54                               ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-16 20:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> On Fri, 11 Jul 2025 10:05:26 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > note that preemption is disabled.  If rcutree.use_softirq is set and
> > this task is blocking an expedited RCU grace period, it will directly
> > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > it will directly invoke the non-notrace function irq_work_queue_on().
> 
> Just to clarify some things; A function annotated by "notrace" simply
> will not have the ftrace hook to that function, but that function may
> very well have tracing triggered inside of it.
> 
> Functions with "_notrace" in its name (like preempt_disable_notrace())
> should not have any tracing instrumentation (as Mathieu stated)
> inside of it, so that it can be used in the tracing infrastructure.
> 
> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> tracing infrastructure call that, and we happen to enable that
> tracepoint, we will have:
> 
>   raise_softirq_irqoff()
>      trace_softirq_raise()
>        [..]
>          raise_softirq_irqoff()
>             trace_softirq_raise()
>                [..]
>                  Ad infinitum!
> 
> I'm not sure if that's what is being proposed or not, but I just wanted
> to make sure everyone is aware of the above.

OK, I *think* I might actually understand the problem.  Maybe.

I am sure that the usual suspects will not be shy about correcting any
misapprehensions in the following.  ;-)

My guess is that some users of real-time Linux would like to use BPF
programs while still getting decent latencies out of their systems.
(Not something I would have predicted, but then again, I was surprised
some years back to see people with a 4096-CPU system complaining about
200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
made some changes some years back to support this, perhaps most notably
replacing some uses of preempt_disable() with migrate_disable().

Except that the current __DECLARE_TRACE() macro defeats this work
for tracepoints by disabling preemption across the tracepoint call,
which might well be a BPF program.  So we need to do something to
__DECLARE_TRACE() to get the right sort of protection while still leaving
preemption enabled.

One way of attacking this problem is to use preemptible RCU.  The problem
with this is that although one could construct a trace-safe version
of rcu_read_unlock(), these would negate some optimizations that Lai
Jiangshan worked so hard to put in place.  Plus those optimizations
also simplified the code quite a bit.  Which is why I was pushing back
so hard, especially given that I did not realize that real-time systems
would be running BPF programs concurrently with real-time applications.
This meant that I was looking for a functional problem with the current
disabling of preemption, and not finding it.

So another way of dealing with this is to use SRCU-fast, which is
like SRCU, but dispenses with the smp_mb() calls and the redundant
read-side array indexing.  Plus it is easy to make _notrace variants
srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
along with the requisite guards.

Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
Remove SRCU protection"), and I have hacked together this and the
prerequisites mentioned in the previous paragraph.

These are passing ridiculously light testing, but probably have at
least their share of bugs.

But first, do I actually finally understand the problem?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-16 20:35                             ` Paul E. McKenney
@ 2025-07-16 22:54                               ` Paul E. McKenney
  2025-07-17 13:14                                 ` Mathieu Desnoyers
                                                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-16 22:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > On Fri, 11 Jul 2025 10:05:26 -0700
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > 
> > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > note that preemption is disabled.  If rcutree.use_softirq is set and
> > > this task is blocking an expedited RCU grace period, it will directly
> > > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > > it will directly invoke the non-notrace function irq_work_queue_on().
> > 
> > Just to clarify some things; A function annotated by "notrace" simply
> > will not have the ftrace hook to that function, but that function may
> > very well have tracing triggered inside of it.
> > 
> > Functions with "_notrace" in its name (like preempt_disable_notrace())
> > should not have any tracing instrumentation (as Mathieu stated)
> > inside of it, so that it can be used in the tracing infrastructure.
> > 
> > raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > tracing infrastructure call that, and we happen to enable that
> > tracepoint, we will have:
> > 
> >   raise_softirq_irqoff()
> >      trace_softirq_raise()
> >        [..]
> >          raise_softirq_irqoff()
> >             trace_softirq_raise()
> >                [..]
> >                  Ad infinitum!
> > 
> > I'm not sure if that's what is being proposed or not, but I just wanted
> > to make sure everyone is aware of the above.
> 
> OK, I *think* I might actually understand the problem.  Maybe.
> 
> I am sure that the usual suspects will not be shy about correcting any
> misapprehensions in the following.  ;-)
> 
> My guess is that some users of real-time Linux would like to use BPF
> programs while still getting decent latencies out of their systems.
> (Not something I would have predicted, but then again, I was surprised
> some years back to see people with a 4096-CPU system complaining about
> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> made some changes some years back to support this, perhaps most notably
> replacing some uses of preempt_disable() with migrate_disable().
> 
> Except that the current __DECLARE_TRACE() macro defeats this work
> for tracepoints by disabling preemption across the tracepoint call,
> which might well be a BPF program.  So we need to do something to
> __DECLARE_TRACE() to get the right sort of protection while still leaving
> preemption enabled.
> 
> One way of attacking this problem is to use preemptible RCU.  The problem
> with this is that although one could construct a trace-safe version
> of rcu_read_unlock(), these would negate some optimizations that Lai
> Jiangshan worked so hard to put in place.  Plus those optimizations
> also simplified the code quite a bit.  Which is why I was pushing back
> so hard, especially given that I did not realize that real-time systems
> would be running BPF programs concurrently with real-time applications.
> This meant that I was looking for a functional problem with the current
> disabling of preemption, and not finding it.
> 
> So another way of dealing with this is to use SRCU-fast, which is
> like SRCU, but dispenses with the smp_mb() calls and the redundant
> read-side array indexing.  Plus it is easy to make _notrace variants
> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> along with the requisite guards.
> 
> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> Remove SRCU protection"), and I have hacked together this and the
> prerequisites mentioned in the previous paragraph.
> 
> These are passing ridiculously light testing, but probably have at
> least their share of bugs.
> 
> But first, do I actually finally understand the problem?

OK, they pass somewhat less ridiculously moderate testing, though I have
not yet hit them over the head with the ftrace selftests.

So might as well post them.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-16 22:54                               ` Paul E. McKenney
@ 2025-07-17 13:14                                 ` Mathieu Desnoyers
  2025-07-17 14:46                                   ` Mathieu Desnoyers
                                                     ` (2 more replies)
  2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
  2025-07-19  0:28                                 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
  2 siblings, 3 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 13:14 UTC (permalink / raw)
  To: paulmck, Steven Rostedt
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On 2025-07-16 18:54, Paul E. McKenney wrote:
> On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
>> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
>>> On Fri, 11 Jul 2025 10:05:26 -0700
>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>>
>>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
>>>> note that preemption is disabled.  If rcutree.use_softirq is set and
>>>> this task is blocking an expedited RCU grace period, it will directly
>>>> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
>>>> it will directly invoke the non-notrace function irq_work_queue_on().
>>>
>>> Just to clarify some things; A function annotated by "notrace" simply
>>> will not have the ftrace hook to that function, but that function may
>>> very well have tracing triggered inside of it.
>>>
>>> Functions with "_notrace" in its name (like preempt_disable_notrace())
>>> should not have any tracing instrumentation (as Mathieu stated)
>>> inside of it, so that it can be used in the tracing infrastructure.
>>>
>>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
>>> tracing infrastructure call that, and we happen to enable that
>>> tracepoint, we will have:
>>>
>>>    raise_softirq_irqoff()
>>>       trace_softirq_raise()
>>>         [..]
>>>           raise_softirq_irqoff()
>>>              trace_softirq_raise()
>>>                 [..]
>>>                   Ad infinitum!
>>>
>>> I'm not sure if that's what is being proposed or not, but I just wanted
>>> to make sure everyone is aware of the above.
>>
>> OK, I *think* I might actually understand the problem.  Maybe.
>>
>> I am sure that the usual suspects will not be shy about correcting any
>> misapprehensions in the following.  ;-)
>>
>> My guess is that some users of real-time Linux would like to use BPF
>> programs while still getting decent latencies out of their systems.
>> (Not something I would have predicted, but then again, I was surprised
>> some years back to see people with a 4096-CPU system complaining about
>> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
>> made some changes some years back to support this, perhaps most notably
>> replacing some uses of preempt_disable() with migrate_disable().
>>
>> Except that the current __DECLARE_TRACE() macro defeats this work
>> for tracepoints by disabling preemption across the tracepoint call,
>> which might well be a BPF program.  So we need to do something to
>> __DECLARE_TRACE() to get the right sort of protection while still leaving
>> preemption enabled.
>>
>> One way of attacking this problem is to use preemptible RCU.  The problem
>> with this is that although one could construct a trace-safe version
>> of rcu_read_unlock(), these would negate some optimizations that Lai
>> Jiangshan worked so hard to put in place.  Plus those optimizations
>> also simplified the code quite a bit.  Which is why I was pushing back
>> so hard, especially given that I did not realize that real-time systems
>> would be running BPF programs concurrently with real-time applications.
>> This meant that I was looking for a functional problem with the current
>> disabling of preemption, and not finding it.
>>
>> So another way of dealing with this is to use SRCU-fast, which is
>> like SRCU, but dispenses with the smp_mb() calls and the redundant
>> read-side array indexing.  Plus it is easy to make _notrace variants
>> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
>> along with the requisite guards.
>>
>> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
>> Remove SRCU protection"), and I have hacked together this and the
>> prerequisites mentioned in the previous paragraph.
>>
>> These are passing ridiculously light testing, but probably have at
>> least their share of bugs.
>>
>> But first, do I actually finally understand the problem?
> 
> OK, they pass somewhat less ridiculously moderate testing, though I have
> not yet hit them over the head with the ftrace selftests.
> 
> So might as well post them.
> 
> Thoughts?

Your explanation of the problem context fits my understanding.

Note that I've mostly been pulled into this by Sebastian who wanted
to understand better the how we could make the tracepoint
instrumentation work with bpf probes that need to sleep due to
locking. Hence my original somewhat high-level desiderata.

I'm glad this seems to be converging towards a concrete solution.

There are two things I'm wondering:

1) Would we want to always use srcu-fast (for both preempt and
    non-preempt kernels ?), or is there any downside compared to
    preempt-off rcu ? (e.g. overhead ?)

    If the overhead is similar when actually used by tracers
    (I'm talking about actual workload benchmark and not a
    microbenchmark), I would tend to err towards simplicity
    and to minimize the number of configurations to test, and
    use srcu-fast everywhere.

2) I think I'm late to the party in reviewing srcu-fast, I'll
    go have a look :)

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 13:14                                 ` Mathieu Desnoyers
@ 2025-07-17 14:46                                   ` Mathieu Desnoyers
  2025-07-17 15:18                                     ` Paul E. McKenney
  2025-07-17 14:57                                   ` Alexei Starovoitov
  2025-07-17 15:07                                   ` Paul E. McKenney
  2 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 14:46 UTC (permalink / raw)
  To: paulmck, Steven Rostedt
  Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
	linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
	Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> On 2025-07-16 18:54, Paul E. McKenney wrote:
[...]
> 
> 2) I think I'm late to the party in reviewing srcu-fast, I'll
>     go have a look :)

OK, I'll bite. :) Please let me know where I'm missing something:

Looking at srcu-lite and srcu-fast, I understand that they fundamentally
depend on a trick we published here https://lwn.net/Articles/573497/
"The RCU-barrier menagerie" that allows turning, e.g. this Dekker:

volatile int x = 0, y = 0

CPU 0              CPU 1

x = 1              y = 1
smp_mb             smp_mb
r2 = y             r4 = x

BUG_ON(r2 == 0 && r4 == 0)

into

volatile int x = 0, y = 0

CPU 0            CPU 1

rcu_read_lock()
x = 1              y = 1
                    synchronize_rcu()
r2 = y             r4 = x
rcu_read_unlock()

BUG_ON(r2 == 0 && r4 == 0)

So looking at srcu-fast, we have:

  * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
  * critical sections either because they disables interrupts, because they
  * are a single instruction, or because they are a read-modify-write atomic
  * operation, depending on the whims of the architecture.

It appears to be pairing, as RCU read-side:

- irq off/on implied by this_cpu_inc
- atomic
- single instruction

with synchronize_rcu within the grace period, and hope that this behaves as a
smp_mb pairing preventing the srcu read-side critical section from leaking
out of the srcu read lock/unlock.

I note that there is a validation that rcu_is_watching() within
__srcu_read_lock_fast, but it's one thing to have rcu watching, but
another to have an actual read-side critical section. Note that
preemption, irqs, softirqs can very well be enabled when calling
__srcu_read_lock_fast.

My understanding of the how memory barriers implemented with RCU
work is that we need to surround the memory accesses on the fast-path
(where we turn smp_mb into barrier) with an RCU read-side critical
section to make sure it does not spawn across a synchronize_rcu.

What I am missing here is how can a RCU side-side that only consist
of the irq off/on or atomic or single instruction cover all memory
accesses we are trying to order, namely those within the srcu
critical section after the compiler barrier() ? Is having RCU
watching sufficient to guarantee this ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 13:14                                 ` Mathieu Desnoyers
  2025-07-17 14:46                                   ` Mathieu Desnoyers
@ 2025-07-17 14:57                                   ` Alexei Starovoitov
  2025-07-17 15:12                                     ` Steven Rostedt
  2025-07-17 15:30                                     ` Paul E. McKenney
  2025-07-17 15:07                                   ` Paul E. McKenney
  2 siblings, 2 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 14:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-07-16 18:54, Paul E. McKenney wrote:
> > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> >>> On Fri, 11 Jul 2025 10:05:26 -0700
> >>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>>
> >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> >>>> note that preemption is disabled.  If rcutree.use_softirq is set and
> >>>> this task is blocking an expedited RCU grace period, it will directly
> >>>> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> >>>
> >>> Just to clarify some things; A function annotated by "notrace" simply
> >>> will not have the ftrace hook to that function, but that function may
> >>> very well have tracing triggered inside of it.
> >>>
> >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> >>> should not have any tracing instrumentation (as Mathieu stated)
> >>> inside of it, so that it can be used in the tracing infrastructure.
> >>>
> >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> >>> tracing infrastructure call that, and we happen to enable that
> >>> tracepoint, we will have:
> >>>
> >>>    raise_softirq_irqoff()
> >>>       trace_softirq_raise()
> >>>         [..]
> >>>           raise_softirq_irqoff()
> >>>              trace_softirq_raise()
> >>>                 [..]
> >>>                   Ad infinitum!
> >>>
> >>> I'm not sure if that's what is being proposed or not, but I just wanted
> >>> to make sure everyone is aware of the above.
> >>
> >> OK, I *think* I might actually understand the problem.  Maybe.
> >>
> >> I am sure that the usual suspects will not be shy about correcting any
> >> misapprehensions in the following.  ;-)
> >>
> >> My guess is that some users of real-time Linux would like to use BPF
> >> programs while still getting decent latencies out of their systems.
> >> (Not something I would have predicted, but then again, I was surprised
> >> some years back to see people with a 4096-CPU system complaining about
> >> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> >> made some changes some years back to support this, perhaps most notably
> >> replacing some uses of preempt_disable() with migrate_disable().
> >>
> >> Except that the current __DECLARE_TRACE() macro defeats this work
> >> for tracepoints by disabling preemption across the tracepoint call,
> >> which might well be a BPF program.  So we need to do something to
> >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> >> preemption enabled.
> >>
> >> One way of attacking this problem is to use preemptible RCU.  The problem
> >> with this is that although one could construct a trace-safe version
> >> of rcu_read_unlock(), these would negate some optimizations that Lai
> >> Jiangshan worked so hard to put in place.  Plus those optimizations
> >> also simplified the code quite a bit.  Which is why I was pushing back
> >> so hard, especially given that I did not realize that real-time systems
> >> would be running BPF programs concurrently with real-time applications.
> >> This meant that I was looking for a functional problem with the current
> >> disabling of preemption, and not finding it.
> >>
> >> So another way of dealing with this is to use SRCU-fast, which is
> >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> >> read-side array indexing.  Plus it is easy to make _notrace variants
> >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> >> along with the requisite guards.
> >>
> >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> >> Remove SRCU protection"), and I have hacked together this and the
> >> prerequisites mentioned in the previous paragraph.
> >>
> >> These are passing ridiculously light testing, but probably have at
> >> least their share of bugs.
> >>
> >> But first, do I actually finally understand the problem?
> >
> > OK, they pass somewhat less ridiculously moderate testing, though I have
> > not yet hit them over the head with the ftrace selftests.
> >
> > So might as well post them.
> >
> > Thoughts?
>
> Your explanation of the problem context fits my understanding.
>
> Note that I've mostly been pulled into this by Sebastian who wanted
> to understand better the how we could make the tracepoint
> instrumentation work with bpf probes that need to sleep due to
> locking. Hence my original somewhat high-level desiderata.

I still don't understand what problem is being solved.
As current tracepoint code stands there is no issue with it at all
on PREEMPT_RT from bpf pov.
bpf progs that attach to tracepoints are not sleepable.
They don't call rt_spinlock either.
Recognizing tracepoints that can sleep/fault and allow
sleepable bpf progs there is on our to do list,
but afaik it doesn't need any changes to tracepoint infra.
There is no need to replace existing preempt_disable wrappers
with sleepable srcu_fast or anything else.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 13:14                                 ` Mathieu Desnoyers
  2025-07-17 14:46                                   ` Mathieu Desnoyers
  2025-07-17 14:57                                   ` Alexei Starovoitov
@ 2025-07-17 15:07                                   ` Paul E. McKenney
  2 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 09:14:41AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-16 18:54, Paul E. McKenney wrote:
> > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > > > On Fri, 11 Jul 2025 10:05:26 -0700
> > > > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > > 
> > > > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > > > note that preemption is disabled.  If rcutree.use_softirq is set and
> > > > > this task is blocking an expedited RCU grace period, it will directly
> > > > > invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > > > > it will directly invoke the non-notrace function irq_work_queue_on().
> > > > 
> > > > Just to clarify some things; A function annotated by "notrace" simply
> > > > will not have the ftrace hook to that function, but that function may
> > > > very well have tracing triggered inside of it.
> > > > 
> > > > Functions with "_notrace" in its name (like preempt_disable_notrace())
> > > > should not have any tracing instrumentation (as Mathieu stated)
> > > > inside of it, so that it can be used in the tracing infrastructure.
> > > > 
> > > > raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > > > tracing infrastructure call that, and we happen to enable that
> > > > tracepoint, we will have:
> > > > 
> > > >    raise_softirq_irqoff()
> > > >       trace_softirq_raise()
> > > >         [..]
> > > >           raise_softirq_irqoff()
> > > >              trace_softirq_raise()
> > > >                 [..]
> > > >                   Ad infinitum!
> > > > 
> > > > I'm not sure if that's what is being proposed or not, but I just wanted
> > > > to make sure everyone is aware of the above.
> > > 
> > > OK, I *think* I might actually understand the problem.  Maybe.
> > > 
> > > I am sure that the usual suspects will not be shy about correcting any
> > > misapprehensions in the following.  ;-)
> > > 
> > > My guess is that some users of real-time Linux would like to use BPF
> > > programs while still getting decent latencies out of their systems.
> > > (Not something I would have predicted, but then again, I was surprised
> > > some years back to see people with a 4096-CPU system complaining about
> > > 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> > > made some changes some years back to support this, perhaps most notably
> > > replacing some uses of preempt_disable() with migrate_disable().
> > > 
> > > Except that the current __DECLARE_TRACE() macro defeats this work
> > > for tracepoints by disabling preemption across the tracepoint call,
> > > which might well be a BPF program.  So we need to do something to
> > > __DECLARE_TRACE() to get the right sort of protection while still leaving
> > > preemption enabled.
> > > 
> > > One way of attacking this problem is to use preemptible RCU.  The problem
> > > with this is that although one could construct a trace-safe version
> > > of rcu_read_unlock(), these would negate some optimizations that Lai
> > > Jiangshan worked so hard to put in place.  Plus those optimizations
> > > also simplified the code quite a bit.  Which is why I was pushing back
> > > so hard, especially given that I did not realize that real-time systems
> > > would be running BPF programs concurrently with real-time applications.
> > > This meant that I was looking for a functional problem with the current
> > > disabling of preemption, and not finding it.
> > > 
> > > So another way of dealing with this is to use SRCU-fast, which is
> > > like SRCU, but dispenses with the smp_mb() calls and the redundant
> > > read-side array indexing.  Plus it is easy to make _notrace variants
> > > srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> > > along with the requisite guards.
> > > 
> > > Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> > > Remove SRCU protection"), and I have hacked together this and the
> > > prerequisites mentioned in the previous paragraph.
> > > 
> > > These are passing ridiculously light testing, but probably have at
> > > least their share of bugs.
> > > 
> > > But first, do I actually finally understand the problem?
> > 
> > OK, they pass somewhat less ridiculously moderate testing, though I have
> > not yet hit them over the head with the ftrace selftests.
> > 
> > So might as well post them.
> > 
> > Thoughts?
> 
> Your explanation of the problem context fits my understanding.
> 
> Note that I've mostly been pulled into this by Sebastian who wanted
> to understand better the how we could make the tracepoint
> instrumentation work with bpf probes that need to sleep due to
> locking. Hence my original somewhat high-level desiderata.
> 
> I'm glad this seems to be converging towards a concrete solution.
> 
> There are two things I'm wondering:
> 
> 1) Would we want to always use srcu-fast (for both preempt and
>    non-preempt kernels ?), or is there any downside compared to
>    preempt-off rcu ? (e.g. overhead ?)

For kernels built with CONFIG_PREEMPT_DYNAMIC=n and either
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, non-preemptible
RCU would be faster.  I did consider this, but decided to keep the
initial patch simple.

>    If the overhead is similar when actually used by tracers
>    (I'm talking about actual workload benchmark and not a
>    microbenchmark), I would tend to err towards simplicity
>    and to minimize the number of configurations to test, and
>    use srcu-fast everywhere.

To this point, I was wondering whether it is still necessary to do the
call_rcu() stage, but left it because that is the safe mistake to make.

I am testing a fifth patch that removes the early-boot deferral of
call_srcu() because call_srcu() now does exactly this deferral internally.

> 2) I think I'm late to the party in reviewing srcu-fast, I'll
>    go have a look :)

Looking forward to seeing what you come up with!

I deferred one further optimization, namely statically classifying
srcu_struct structures as intended for vanilla, _nmisafe(), or _fast()
use, or at least doing so at initialization time.  This would get rid
of the call to srcu_check_read_flavor_force() in srcu_read_lock_fast()
srcu_read_unlock_fast(), and friends, or at least to tuck it under
CONFIG_PROVE_RCU.  On my laptop, this saves an additional 25%, though
that 25% amounts to a big half of a nanosecond.

Thoughts?

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 14:57                                   ` Alexei Starovoitov
@ 2025-07-17 15:12                                     ` Steven Rostedt
  2025-07-17 15:27                                       ` Alexei Starovoitov
  2025-07-17 15:30                                     ` Paul E. McKenney
  1 sibling, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 07:57:27 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I still don't understand what problem is being solved.
> As current tracepoint code stands there is no issue with it at all
> on PREEMPT_RT from bpf pov.
> bpf progs that attach to tracepoints are not sleepable.
> They don't call rt_spinlock either.
> Recognizing tracepoints that can sleep/fault and allow
> sleepable bpf progs there is on our to do list,
> but afaik it doesn't need any changes to tracepoint infra.
> There is no need to replace existing preempt_disable wrappers
> with sleepable srcu_fast or anything else.

From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
may stop migration, but if someone adds a long running BPF program
(when I say long running, it could be anything more than 10us), and it
executes on a low priority task. If that BPF program is not preemptable
it can delay a high priority task from running. That defeats the
purpose of PREEMPT_RT.

If this is unsolvable, then we will need to make PREEMPT_RT and BPF
mutually exclusive in the configs.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 14:46                                   ` Mathieu Desnoyers
@ 2025-07-17 15:18                                     ` Paul E. McKenney
  2025-07-17 19:36                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > On 2025-07-16 18:54, Paul E. McKenney wrote:
> [...]
> > 
> > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> >     go have a look :)
> 
> OK, I'll bite. :) Please let me know where I'm missing something:
> 
> Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> depend on a trick we published here https://lwn.net/Articles/573497/
> "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
> 
> volatile int x = 0, y = 0
> 
> CPU 0              CPU 1
> 
> x = 1              y = 1
> smp_mb             smp_mb
> r2 = y             r4 = x
> 
> BUG_ON(r2 == 0 && r4 == 0)
> 
> into
> 
> volatile int x = 0, y = 0
> 
> CPU 0            CPU 1
> 
> rcu_read_lock()
> x = 1              y = 1
>                    synchronize_rcu()
> r2 = y             r4 = x
> rcu_read_unlock()
> 
> BUG_ON(r2 == 0 && r4 == 0)
> 
> So looking at srcu-fast, we have:
> 
>  * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
>  * critical sections either because they disables interrupts, because they
>  * are a single instruction, or because they are a read-modify-write atomic
>  * operation, depending on the whims of the architecture.
> 
> It appears to be pairing, as RCU read-side:
> 
> - irq off/on implied by this_cpu_inc
> - atomic
> - single instruction
> 
> with synchronize_rcu within the grace period, and hope that this behaves as a
> smp_mb pairing preventing the srcu read-side critical section from leaking
> out of the srcu read lock/unlock.
> 
> I note that there is a validation that rcu_is_watching() within
> __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> another to have an actual read-side critical section. Note that
> preemption, irqs, softirqs can very well be enabled when calling
> __srcu_read_lock_fast.
> 
> My understanding of the how memory barriers implemented with RCU
> work is that we need to surround the memory accesses on the fast-path
> (where we turn smp_mb into barrier) with an RCU read-side critical
> section to make sure it does not spawn across a synchronize_rcu.
> 
> What I am missing here is how can a RCU side-side that only consist
> of the irq off/on or atomic or single instruction cover all memory
> accesses we are trying to order, namely those within the srcu
> critical section after the compiler barrier() ? Is having RCU
> watching sufficient to guarantee this ?

Good eyes!!!

The trick is that this "RCU read-side critical section" consists only of
either this_cpu_inc() or atomic_long_inc(), with the latter only happening
in systems that have NMIs, but don't have NMI-safe per-CPU operations.
Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
thus both act as an interrupts-disabled RCU read-side critical section.

Therefore, if the SRCU grace-period computation fails to see an
srcu_read_lock_fast() increment, its earlier code is guaranteed to
happen before the corresponding critical section.  Similarly, if the SRCU
grace-period computation sees an srcu_read_unlock_fast(), its subsequent
code is guaranteed to happen after the corresponding critical section.

Does that help?  If so, would you be interested and nominating a comment?

Or am I missing something subtle here?

Either way, many thanks for digging into this!!!

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:12                                     ` Steven Rostedt
@ 2025-07-17 15:27                                       ` Alexei Starovoitov
  2025-07-17 15:40                                         ` Steven Rostedt
  2025-07-17 15:44                                         ` Paul E. McKenney
  0 siblings, 2 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 15:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 17 Jul 2025 07:57:27 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > I still don't understand what problem is being solved.
> > As current tracepoint code stands there is no issue with it at all
> > on PREEMPT_RT from bpf pov.
> > bpf progs that attach to tracepoints are not sleepable.
> > They don't call rt_spinlock either.
> > Recognizing tracepoints that can sleep/fault and allow
> > sleepable bpf progs there is on our to do list,
> > but afaik it doesn't need any changes to tracepoint infra.
> > There is no need to replace existing preempt_disable wrappers
> > with sleepable srcu_fast or anything else.
>
> From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
> may stop migration, but if someone adds a long running BPF program
> (when I say long running, it could be anything more than 10us), and it
> executes on a low priority task. If that BPF program is not preemptable
> it can delay a high priority task from running. That defeats the
> purpose of PREEMPT_RT.
>
> If this is unsolvable, then we will need to make PREEMPT_RT and BPF
> mutually exclusive in the configs.

Stop this fud, please.

bpf progs were preemptible for years and had no issue in RT.
tracepoints are using preempt_disable() still and that's a
tracepoint infra problem. Nothing to do with users of tracepoints.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 14:57                                   ` Alexei Starovoitov
  2025-07-17 15:12                                     ` Steven Rostedt
@ 2025-07-17 15:30                                     ` Paul E. McKenney
  1 sibling, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Steven Rostedt, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 07:57:27AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> > >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > >>> On Fri, 11 Jul 2025 10:05:26 -0700
> > >>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >>>
> > >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > >>>> note that preemption is disabled.  If rcutree.use_softirq is set and
> > >>>> this task is blocking an expedited RCU grace period, it will directly
> > >>>> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> > >>>
> > >>> Just to clarify some things; A function annotated by "notrace" simply
> > >>> will not have the ftrace hook to that function, but that function may
> > >>> very well have tracing triggered inside of it.
> > >>>
> > >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> > >>> should not have any tracing instrumentation (as Mathieu stated)
> > >>> inside of it, so that it can be used in the tracing infrastructure.
> > >>>
> > >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > >>> tracing infrastructure call that, and we happen to enable that
> > >>> tracepoint, we will have:
> > >>>
> > >>>    raise_softirq_irqoff()
> > >>>       trace_softirq_raise()
> > >>>         [..]
> > >>>           raise_softirq_irqoff()
> > >>>              trace_softirq_raise()
> > >>>                 [..]
> > >>>                   Ad infinitum!
> > >>>
> > >>> I'm not sure if that's what is being proposed or not, but I just wanted
> > >>> to make sure everyone is aware of the above.
> > >>
> > >> OK, I *think* I might actually understand the problem.  Maybe.
> > >>
> > >> I am sure that the usual suspects will not be shy about correcting any
> > >> misapprehensions in the following.  ;-)
> > >>
> > >> My guess is that some users of real-time Linux would like to use BPF
> > >> programs while still getting decent latencies out of their systems.
> > >> (Not something I would have predicted, but then again, I was surprised
> > >> some years back to see people with a 4096-CPU system complaining about
> > >> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> > >> made some changes some years back to support this, perhaps most notably
> > >> replacing some uses of preempt_disable() with migrate_disable().
> > >>
> > >> Except that the current __DECLARE_TRACE() macro defeats this work
> > >> for tracepoints by disabling preemption across the tracepoint call,
> > >> which might well be a BPF program.  So we need to do something to
> > >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> > >> preemption enabled.
> > >>
> > >> One way of attacking this problem is to use preemptible RCU.  The problem
> > >> with this is that although one could construct a trace-safe version
> > >> of rcu_read_unlock(), these would negate some optimizations that Lai
> > >> Jiangshan worked so hard to put in place.  Plus those optimizations
> > >> also simplified the code quite a bit.  Which is why I was pushing back
> > >> so hard, especially given that I did not realize that real-time systems
> > >> would be running BPF programs concurrently with real-time applications.
> > >> This meant that I was looking for a functional problem with the current
> > >> disabling of preemption, and not finding it.
> > >>
> > >> So another way of dealing with this is to use SRCU-fast, which is
> > >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> > >> read-side array indexing.  Plus it is easy to make _notrace variants
> > >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> > >> along with the requisite guards.
> > >>
> > >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> > >> Remove SRCU protection"), and I have hacked together this and the
> > >> prerequisites mentioned in the previous paragraph.
> > >>
> > >> These are passing ridiculously light testing, but probably have at
> > >> least their share of bugs.
> > >>
> > >> But first, do I actually finally understand the problem?
> > >
> > > OK, they pass somewhat less ridiculously moderate testing, though I have
> > > not yet hit them over the head with the ftrace selftests.
> > >
> > > So might as well post them.
> > >
> > > Thoughts?
> >
> > Your explanation of the problem context fits my understanding.
> >
> > Note that I've mostly been pulled into this by Sebastian who wanted
> > to understand better the how we could make the tracepoint
> > instrumentation work with bpf probes that need to sleep due to
> > locking. Hence my original somewhat high-level desiderata.
> 
> I still don't understand what problem is being solved.
> As current tracepoint code stands there is no issue with it at all
> on PREEMPT_RT from bpf pov.
> bpf progs that attach to tracepoints are not sleepable.
> They don't call rt_spinlock either.
> Recognizing tracepoints that can sleep/fault and allow
> sleepable bpf progs there is on our to do list,
> but afaik it doesn't need any changes to tracepoint infra.
> There is no need to replace existing preempt_disable wrappers
> with sleepable srcu_fast or anything else.

As I understand it, functionally there is no problem.

And if the BPF program has been attached to preemption-disabled code,
there is also no unnecessary damage to real-time latency.  Preemption is
presumably disabled for a reason, and that reason must be respected.
Presumably, people using BPF in real-time systems are careful to either
avoid attaching BPF programs to non-preemptible code on the one hand or
carefully limit the runtime of those BPF programs.

But if the BPF program has been attached to code having preemption
enabled, then the current call to guard(preempt_notrace)() within
__DECLARE_TRACE() means that the entire BPF program is running with
preemption disabled, and for no good reason.

Although this is not a functional issue, it is a first-class bug in
terms of needlessly degrading real-time latency.

And even though it isn't what Sebastian originally asked Mathieu for
help with, it still needs to be fixed.

My current offer is replacing that guard(preempt_notrace)() with its
SRCU-fast counterpart of guard(srcu_fast_notrace)(&tracepoint_srcu).

Light testing is going well thus far, though I clearly need the real-time
guys to both review and test.

I am sure that Mathieu, Sebastian, and the rest won't be shy about
correcting any confusion on my part.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:27                                       ` Alexei Starovoitov
@ 2025-07-17 15:40                                         ` Steven Rostedt
  2025-07-17 15:55                                           ` Steven Rostedt
  2025-07-17 15:44                                         ` Paul E. McKenney
  1 sibling, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 08:27:24 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 17 Jul 2025 07:57:27 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >  
> > > I still don't understand what problem is being solved.
> > > As current tracepoint code stands there is no issue with it at all
> > > on PREEMPT_RT from bpf pov.
> > > bpf progs that attach to tracepoints are not sleepable.

> 
> Stop this fud, please.

No fud. Above you said they are not sleepable. I guess that just means
they don't call schedule?

> 
> bpf progs were preemptible for years and had no issue in RT.
> tracepoints are using preempt_disable() still and that's a
> tracepoint infra problem. Nothing to do with users of tracepoints.

Yes, it is a tracepoint infra problem that we are trying to solve. The
reason we are trying to solve it is because BPF programs can extend the
time a tracepoint takes. If anything else extended the time, this would
need to be solved as well. But currently it's only BPF programs that
cause the issue.

Let me explain the problem being solved then.

Tracepoints currently disable preemption with preempt_disable_notrace()
to allow things to be synchronized correctly. But with the disabling of
preemption for the users of tracepoints (BPF programs being one of
them), it can result in large latency for RT tasks.

The request is to use RCU instead as RCU in PREEMPT_RT can be
preempted. That means we need a rcu_read_lock_notrace() version. One
that doesn't call into any tracing infrastructure (like current
rcu_read_lock() does), otherwise there could be an infinite recursion.

The discussion at hand is how do we come up with a
rcu_read_lock_notrace(), and using rcu_read_lock_fast() may be one of
the solutions as it can be implemented without triggering tracing
infrastructure.

Does that explain it better?

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:27                                       ` Alexei Starovoitov
  2025-07-17 15:40                                         ` Steven Rostedt
@ 2025-07-17 15:44                                         ` Paul E. McKenney
  1 sibling, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Mathieu Desnoyers, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 08:27:24AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 17 Jul 2025 07:57:27 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > I still don't understand what problem is being solved.
> > > As current tracepoint code stands there is no issue with it at all
> > > on PREEMPT_RT from bpf pov.
> > > bpf progs that attach to tracepoints are not sleepable.
> > > They don't call rt_spinlock either.
> > > Recognizing tracepoints that can sleep/fault and allow
> > > sleepable bpf progs there is on our to do list,
> > > but afaik it doesn't need any changes to tracepoint infra.
> > > There is no need to replace existing preempt_disable wrappers
> > > with sleepable srcu_fast or anything else.
> >
> > From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
> > may stop migration, but if someone adds a long running BPF program
> > (when I say long running, it could be anything more than 10us), and it
> > executes on a low priority task. If that BPF program is not preemptable
> > it can delay a high priority task from running. That defeats the
> > purpose of PREEMPT_RT.
> >
> > If this is unsolvable, then we will need to make PREEMPT_RT and BPF
> > mutually exclusive in the configs.
> 
> Stop this fud, please.
> 
> bpf progs were preemptible for years and had no issue in RT.
> tracepoints are using preempt_disable() still and that's a
> tracepoint infra problem. Nothing to do with users of tracepoints.

And the tracepoint infrastructure is in fact where my proposed fix
is located.

To be fair, several upgrades to SRCU-fast were required to handle this
particular use case.

But to Steven's point, if there was no feasible fix, wherever that fix
might be, then users of real-time Linux would (at best!) need to be *very*
careful about how they used BPF.  In fact, those having safety-critical
appliations might well choose to turn BPF off entirely, just to prevent
accidents.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:40                                         ` Steven Rostedt
@ 2025-07-17 15:55                                           ` Steven Rostedt
  2025-07-17 16:02                                             ` Alexei Starovoitov
  2025-07-17 16:04                                             ` Paul E. McKenney
  0 siblings, 2 replies; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 11:40:28 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Yes, it is a tracepoint infra problem that we are trying to solve. The
> reason we are trying to solve it is because BPF programs can extend the
> time a tracepoint takes. If anything else extended the time, this would
> need to be solved as well. But currently it's only BPF programs that
> cause the issue.

BTW, if we can't solve this issue and something else came along and
attached to tracepoints that caused unbounded latency, I would also
argue that whatever came along would need to be prevented from being
configured with PREEMPT_RT. My comment wasn't a strike against BPF
programs; It was a strike against something adding unbounded latency
into a critical section that has preemption disabled.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:55                                           ` Steven Rostedt
@ 2025-07-17 16:02                                             ` Alexei Starovoitov
  2025-07-17 16:19                                               ` Steven Rostedt
  2025-07-17 17:38                                               ` Mathieu Desnoyers
  2025-07-17 16:04                                             ` Paul E. McKenney
  1 sibling, 2 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 16:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 8:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 17 Jul 2025 11:40:28 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Yes, it is a tracepoint infra problem that we are trying to solve. The
> > reason we are trying to solve it is because BPF programs can extend the
> > time a tracepoint takes. If anything else extended the time, this would
> > need to be solved as well. But currently it's only BPF programs that
> > cause the issue.
>
> BTW, if we can't solve this issue and something else came along and
> attached to tracepoints that caused unbounded latency, I would also
> argue that whatever came along would need to be prevented from being
> configured with PREEMPT_RT. My comment wasn't a strike against BPF
> programs; It was a strike against something adding unbounded latency
> into a critical section that has preemption disabled.

Stop blaming the users. Tracepoints disable preemption for now
good reason and you keep shifting the blame.
Fix tracepoint infra.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:55                                           ` Steven Rostedt
  2025-07-17 16:02                                             ` Alexei Starovoitov
@ 2025-07-17 16:04                                             ` Paul E. McKenney
  1 sibling, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 16:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Mathieu Desnoyers, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 11:55:10AM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 11:40:28 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Yes, it is a tracepoint infra problem that we are trying to solve. The
> > reason we are trying to solve it is because BPF programs can extend the
> > time a tracepoint takes. If anything else extended the time, this would
> > need to be solved as well. But currently it's only BPF programs that
> > cause the issue.
> 
> BTW, if we can't solve this issue and something else came along and
> attached to tracepoints that caused unbounded latency, I would also
> argue that whatever came along would need to be prevented from being
> configured with PREEMPT_RT. My comment wasn't a strike against BPF
> programs; It was a strike against something adding unbounded latency
> into a critical section that has preemption disabled.

I would imagine that many PREEMPT_RT installations would have a severe
and heavy-weight review, test, and approval process for BPF programs
intended for use in production systems...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 16:02                                             ` Alexei Starovoitov
@ 2025-07-17 16:19                                               ` Steven Rostedt
  2025-07-17 17:38                                               ` Mathieu Desnoyers
  1 sibling, 0 replies; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 16:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
	Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
	Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
	Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 09:02:38 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Stop blaming the users. Tracepoints disable preemption for now
> good reason and you keep shifting the blame.
> Fix tracepoint infra.

The fix *is* to have tracepoints not disable preemption. Your first
reply said:

   There is no need to replace existing preempt_disable wrappers
   with sleepable srcu_fast or anything else.

What is it? To replace the existing preemptable wrappers is the
solution. But you said it isn't needed.

Those using PREEMPT_RT look to blame the users that cause unbounded
latency. When you design an RT system, you look at what causes
unbounded latency and disable it.

Blaming the users is part of the process!

If BPF programs attached to tracepoints cannot be preempted, it means
it must not be used with PREEMPT_RT. FULL STOP!

It doesn't mean BPF programs can't be used. It just can't be used when
you care about bounded latency.

There's even people looking into configuring the kernel where if
anything gets called that can cause an unbounded latency, it would
trigger BUG(). As crashing the kernel in safety critical systems is
actually better than missing a deadline.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 16:02                                             ` Alexei Starovoitov
  2025-07-17 16:19                                               ` Steven Rostedt
@ 2025-07-17 17:38                                               ` Mathieu Desnoyers
  1 sibling, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Paul E. McKenney, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On 2025-07-17 12:02, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 8:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Thu, 17 Jul 2025 11:40:28 -0400
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> Yes, it is a tracepoint infra problem that we are trying to solve. The
>>> reason we are trying to solve it is because BPF programs can extend the
>>> time a tracepoint takes. If anything else extended the time, this would
>>> need to be solved as well. But currently it's only BPF programs that
>>> cause the issue.
>>
>> BTW, if we can't solve this issue and something else came along and
>> attached to tracepoints that caused unbounded latency, I would also
>> argue that whatever came along would need to be prevented from being
>> configured with PREEMPT_RT. My comment wasn't a strike against BPF
>> programs; It was a strike against something adding unbounded latency
>> into a critical section that has preemption disabled.
> 
> Stop blaming the users. Tracepoints disable preemption for now
> good reason and you keep shifting the blame.
> Fix tracepoint infra.

I think we're all agreeing here that it's the tracepoint instrumentation
infrastructure that needs to be changed to become more RT friendly, and
that BPF is merely the tracepoint user that happens to have the longest
running callbacks at the moment.

So AFAIU there is no blame on BPF here, and no expectation for any
changes in BPF neither.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
  2025-07-16 22:54                               ` Paul E. McKenney
  2025-07-17 13:14                                 ` Mathieu Desnoyers
@ 2025-07-17 19:04                                 ` Paul E. McKenney
  2025-07-17 19:19                                   ` Steven Rostedt
  2025-07-19  0:28                                 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
  2 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 19:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

This adds the usual scoped_guard(srcu_fast, &my_srcu) and
guard(srcu_fast)(&my_srcu).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 srcu.h |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 0aa2376cca0b1..ada65b58bc4c5 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
 		    srcu_read_unlock(_T->lock, _T->idx),
 		    int idx)
 
+DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
+		    _T->scp = srcu_read_lock_fast(_T->lock),
+		    srcu_read_unlock_fast(_T->lock, _T->scp),
+		    struct srcu_ctr __percpu *scp)
+
 DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
 		    _T->scp = srcu_read_lock_fast_notrace(_T->lock),
 		    srcu_read_unlock_fast_notrace(_T->lock, _T->scp),

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
  2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
@ 2025-07-17 19:19                                   ` Steven Rostedt
  2025-07-17 19:51                                     ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 19:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 12:04:46 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> This adds the usual scoped_guard(srcu_fast, &my_srcu) and
> guard(srcu_fast)(&my_srcu).
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

> ---
>  srcu.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 0aa2376cca0b1..ada65b58bc4c5 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
>  		    srcu_read_unlock(_T->lock, _T->idx),
>  		    int idx)
>  
> +DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> +		    _T->scp = srcu_read_lock_fast(_T->lock),
> +		    srcu_read_unlock_fast(_T->lock, _T->scp),
> +		    struct srcu_ctr __percpu *scp)
> +
>  DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
>  		    _T->scp = srcu_read_lock_fast_notrace(_T->lock),
>  		    srcu_read_unlock_fast_notrace(_T->lock, _T->scp),


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 15:18                                     ` Paul E. McKenney
@ 2025-07-17 19:36                                       ` Mathieu Desnoyers
  2025-07-17 21:27                                         ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 19:36 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On 2025-07-17 11:18, Paul E. McKenney wrote:
> On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-17 09:14, Mathieu Desnoyers wrote:
>>> On 2025-07-16 18:54, Paul E. McKenney wrote:
>> [...]
>>>
>>> 2) I think I'm late to the party in reviewing srcu-fast, I'll
>>>      go have a look :)
>>
>> OK, I'll bite. :) Please let me know where I'm missing something:
>>
>> Looking at srcu-lite and srcu-fast, I understand that they fundamentally
>> depend on a trick we published here https://lwn.net/Articles/573497/
>> "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0              CPU 1
>>
>> x = 1              y = 1
>> smp_mb             smp_mb
>> r2 = y             r4 = x
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> into
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0            CPU 1
>>
>> rcu_read_lock()
>> x = 1              y = 1
>>                     synchronize_rcu()
>> r2 = y             r4 = x
>> rcu_read_unlock()
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> So looking at srcu-fast, we have:
>>
>>   * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
>>   * critical sections either because they disables interrupts, because they
>>   * are a single instruction, or because they are a read-modify-write atomic
>>   * operation, depending on the whims of the architecture.
>>
>> It appears to be pairing, as RCU read-side:
>>
>> - irq off/on implied by this_cpu_inc
>> - atomic
>> - single instruction
>>
>> with synchronize_rcu within the grace period, and hope that this behaves as a
>> smp_mb pairing preventing the srcu read-side critical section from leaking
>> out of the srcu read lock/unlock.
>>
>> I note that there is a validation that rcu_is_watching() within
>> __srcu_read_lock_fast, but it's one thing to have rcu watching, but
>> another to have an actual read-side critical section. Note that
>> preemption, irqs, softirqs can very well be enabled when calling
>> __srcu_read_lock_fast.
>>
>> My understanding of the how memory barriers implemented with RCU
>> work is that we need to surround the memory accesses on the fast-path
>> (where we turn smp_mb into barrier) with an RCU read-side critical
>> section to make sure it does not spawn across a synchronize_rcu.
>>
>> What I am missing here is how can a RCU side-side that only consist
>> of the irq off/on or atomic or single instruction cover all memory
>> accesses we are trying to order, namely those within the srcu
>> critical section after the compiler barrier() ? Is having RCU
>> watching sufficient to guarantee this ?
> 
> Good eyes!!!
> 
> The trick is that this "RCU read-side critical section" consists only of
> either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> thus both act as an interrupts-disabled RCU read-side critical section.
> 
> Therefore, if the SRCU grace-period computation fails to see an
> srcu_read_lock_fast() increment, its earlier code is guaranteed to
> happen before the corresponding critical section.  Similarly, if the SRCU
> grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> code is guaranteed to happen after the corresponding critical section.
> 
> Does that help?  If so, would you be interested and nominating a comment?
> 
> Or am I missing something subtle here?

Here is the root of my concern: considering a single instruction
as an RCU-barrier "read-side" for a classic Dekker would not work,
because the read-side would not cover both memory accesses that need
to be ordered.

I cannot help but notice the similarity between this pattern of
barrier vs synchronize_rcu and what we allow userspace to do with
barrier vs sys_membarrier, which has one implementation
based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
when membarrier was introduced, this was based on synchronize_sched(),
and I recall that this was OK because userspace execution acted as
a read-side critical section from the perspective of synchronize_sched().
As commented in kernel v4.10 near synchronize_sched():

  * Note that this guarantee implies further memory-ordering guarantees.
  * On systems with more than one CPU, when synchronize_sched() returns,
  * each CPU is guaranteed to have executed a full memory barrier since the
  * end of its last RCU-sched read-side critical section whose beginning
  * preceded the call to synchronize_sched().  In addition, each CPU having
  * an RCU read-side critical section that extends beyond the return from
  * synchronize_sched() is guaranteed to have executed a full memory barrier
  * after the beginning of synchronize_sched() and before the beginning of
  * that RCU read-side critical section.  Note that these guarantees include
  * CPUs that are offline, idle, or executing in user mode, as well as CPUs
  * that are executing in the kernel.

So even though I see how synchronize_rcu() nowadays is still a good
choice to implement sys_membarrier, it only apply to RCU read side
critical sections, which covers userspace code and the specific
read-side critical sections in the kernel.

But what I don't get is how synchronize_rcu() can help us promote
the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
critical section tracked by the synchronize_rcu grace period,
mainly because unlike the sys_membarrier scenario, this is *not*
userspace code.

And what we want to order here on the read-side is the lock/unlock
increments vs the memory accesses within the critical section, but
there is no RCU read-side that contain all those memory accesses
that match those synchronize_rcu calls, so the promotion from barrier
to smp_mb don't appear to be valid.

But perhaps there is something more that is specific to the SRCU
algorithm that I missing here ?

Thanks,

Mathieu

> 
> Either way, many thanks for digging into this!!!
> 
> 							Thanx, Paul
> 
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
  2025-07-17 19:19                                   ` Steven Rostedt
@ 2025-07-17 19:51                                     ` Paul E. McKenney
  2025-07-17 19:56                                       ` Steven Rostedt
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 03:19:34PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 12:04:46 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > This adds the usual scoped_guard(srcu_fast, &my_srcu) and
> > guard(srcu_fast)(&my_srcu).
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thank you!  I will apply this on my next rebase.

The normal process puts this into the v6.18 merge window, that it, the
merge window after the upcoming one.  If you need it sooner, please let
us know.

							Thanx, Paul

> -- Steve
> 
> > ---
> >  srcu.h |    5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index 0aa2376cca0b1..ada65b58bc4c5 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
> >  		    srcu_read_unlock(_T->lock, _T->idx),
> >  		    int idx)
> >  
> > +DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> > +		    _T->scp = srcu_read_lock_fast(_T->lock),
> > +		    srcu_read_unlock_fast(_T->lock, _T->scp),
> > +		    struct srcu_ctr __percpu *scp)
> > +
> >  DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
> >  		    _T->scp = srcu_read_lock_fast_notrace(_T->lock),
> >  		    srcu_read_unlock_fast_notrace(_T->lock, _T->scp),
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
  2025-07-17 19:51                                     ` Paul E. McKenney
@ 2025-07-17 19:56                                       ` Steven Rostedt
  2025-07-17 20:38                                         ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2025-07-17 19:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, 17 Jul 2025 12:51:29 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Thank you!  I will apply this on my next rebase.
> 
> The normal process puts this into the v6.18 merge window, that it, the
> merge window after the upcoming one.  If you need it sooner, please let
> us know.

I'd suggest 6.17. I can understand the delay for updates to the RCU
logic for how subtle it can be, where more testing is required. But
adding a guard() helper is pretty trivial and useful to have. I
wouldn't delay it.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
  2025-07-17 19:56                                       ` Steven Rostedt
@ 2025-07-17 20:38                                         ` Paul E. McKenney
  0 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 20:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 03:56:38PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 12:51:29 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > Thank you!  I will apply this on my next rebase.
> > 
> > The normal process puts this into the v6.18 merge window, that it, the
> > merge window after the upcoming one.  If you need it sooner, please let
> > us know.
> 
> I'd suggest 6.17. I can understand the delay for updates to the RCU
> logic for how subtle it can be, where more testing is required. But
> adding a guard() helper is pretty trivial and useful to have. I
> wouldn't delay it.

OK thank you, and let's see what we can do in v6.17.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-17 19:36                                       ` Mathieu Desnoyers
@ 2025-07-17 21:27                                         ` Paul E. McKenney
  0 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-17 21:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

On Thu, Jul 17, 2025 at 03:36:46PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-17 11:18, Paul E. McKenney wrote:
> > On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > > > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > [...]
> > > > 
> > > > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> > > >      go have a look :)
> > > 
> > > OK, I'll bite. :) Please let me know where I'm missing something:
> > > 
> > > Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> > > depend on a trick we published here https://lwn.net/Articles/573497/
> > > "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
> > > 
> > > volatile int x = 0, y = 0
> > > 
> > > CPU 0              CPU 1
> > > 
> > > x = 1              y = 1
> > > smp_mb             smp_mb
> > > r2 = y             r4 = x
> > > 
> > > BUG_ON(r2 == 0 && r4 == 0)
> > > 
> > > into
> > > 
> > > volatile int x = 0, y = 0
> > > 
> > > CPU 0            CPU 1
> > > 
> > > rcu_read_lock()
> > > x = 1              y = 1
> > >                     synchronize_rcu()
> > > r2 = y             r4 = x
> > > rcu_read_unlock()
> > > 
> > > BUG_ON(r2 == 0 && r4 == 0)
> > > 
> > > So looking at srcu-fast, we have:
> > > 
> > >   * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> > >   * critical sections either because they disables interrupts, because they
> > >   * are a single instruction, or because they are a read-modify-write atomic
> > >   * operation, depending on the whims of the architecture.
> > > 
> > > It appears to be pairing, as RCU read-side:
> > > 
> > > - irq off/on implied by this_cpu_inc
> > > - atomic
> > > - single instruction
> > > 
> > > with synchronize_rcu within the grace period, and hope that this behaves as a
> > > smp_mb pairing preventing the srcu read-side critical section from leaking
> > > out of the srcu read lock/unlock.
> > > 
> > > I note that there is a validation that rcu_is_watching() within
> > > __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> > > another to have an actual read-side critical section. Note that
> > > preemption, irqs, softirqs can very well be enabled when calling
> > > __srcu_read_lock_fast.
> > > 
> > > My understanding of the how memory barriers implemented with RCU
> > > work is that we need to surround the memory accesses on the fast-path
> > > (where we turn smp_mb into barrier) with an RCU read-side critical
> > > section to make sure it does not spawn across a synchronize_rcu.
> > > 
> > > What I am missing here is how can a RCU side-side that only consist
> > > of the irq off/on or atomic or single instruction cover all memory
> > > accesses we are trying to order, namely those within the srcu
> > > critical section after the compiler barrier() ? Is having RCU
> > > watching sufficient to guarantee this ?
> > 
> > Good eyes!!!
> > 
> > The trick is that this "RCU read-side critical section" consists only of
> > either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> > in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> > Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> > thus both act as an interrupts-disabled RCU read-side critical section.
> > 
> > Therefore, if the SRCU grace-period computation fails to see an
> > srcu_read_lock_fast() increment, its earlier code is guaranteed to
> > happen before the corresponding critical section.  Similarly, if the SRCU
> > grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> > code is guaranteed to happen after the corresponding critical section.
> > 
> > Does that help?  If so, would you be interested and nominating a comment?
> > 
> > Or am I missing something subtle here?
> 
> Here is the root of my concern: considering a single instruction
> as an RCU-barrier "read-side" for a classic Dekker would not work,
> because the read-side would not cover both memory accesses that need
> to be ordered.

You would have a pair of RCU read-side critical sections, and if the
second one was waited on by a given call to synchronize_rcu(), then the
beginning of the first RCU reader would also have preceded that call,
and thus the synchronize_rcu() full-memory-barrier would apply to both
of those RCU read-side critical sections..

But please see below for more detail.

> I cannot help but notice the similarity between this pattern of
> barrier vs synchronize_rcu and what we allow userspace to do with
> barrier vs sys_membarrier, which has one implementation
> based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
> when membarrier was introduced, this was based on synchronize_sched(),
> and I recall that this was OK because userspace execution acted as
> a read-side critical section from the perspective of synchronize_sched().
> As commented in kernel v4.10 near synchronize_sched():
> 
>  * Note that this guarantee implies further memory-ordering guarantees.
>  * On systems with more than one CPU, when synchronize_sched() returns,
>  * each CPU is guaranteed to have executed a full memory barrier since the
>  * end of its last RCU-sched read-side critical section whose beginning
>  * preceded the call to synchronize_sched().  In addition, each CPU having
>  * an RCU read-side critical section that extends beyond the return from
>  * synchronize_sched() is guaranteed to have executed a full memory barrier
>  * after the beginning of synchronize_sched() and before the beginning of
>  * that RCU read-side critical section.  Note that these guarantees include
>  * CPUs that are offline, idle, or executing in user mode, as well as CPUs
>  * that are executing in the kernel.
> 
> So even though I see how synchronize_rcu() nowadays is still a good
> choice to implement sys_membarrier, it only apply to RCU read side
> critical sections, which covers userspace code and the specific
> read-side critical sections in the kernel.

Yes, it does only apply to RCU read-side critical sections, but the
synchronize_rcu() comment header explicitly states that, given a region
of code across which interrupts are disabled, that region of code also
acts as an RCU read-side critical section:

 * RCU read-side critical sections are delimited by rcu_read_lock()
 * and rcu_read_unlock(), and may be nested.  In addition, but only in
 * v5.0 and later, regions of code across which interrupts, preemption,
 * or softirqs have been disabled also serve as RCU read-side critical
 * sections.  This includes hardware interrupt handlers, softirq handlers,
 * and NMI handlers.

Interrupts cannot happen in the midst of either this_cpu_inc() or
atomic_long_inc(), so these do act as tiny RCU readers.

And just to be painfully clear, synchronize_rcu() guarantees full barriers
associated with any reader that stick out of the synchronize_rcu()
in either direction:

 * Note that this guarantee implies further memory-ordering guarantees.
 * On systems with more than one CPU, when synchronize_rcu() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last RCU read-side critical section whose beginning
 * preceded the call to synchronize_rcu().  In addition, each CPU having
 * an RCU read-side critical section that extends beyond the return from
 * synchronize_rcu() is guaranteed to have executed a full memory barrier
 * after the beginning of synchronize_rcu() and before the beginning of
 * that RCU read-side critical section.  Note that these guarantees include
 * CPUs that are offline, idle, or executing in user mode, as well as CPUs
 * that are executing in the kernel.

> But what I don't get is how synchronize_rcu() can help us promote
> the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
> critical section tracked by the synchronize_rcu grace period,
> mainly because unlike the sys_membarrier scenario, this is *not*
> userspace code.

Just for those following along, there is a barrier() call at the
end of __srcu_read_lock_fast() and another at the beginning of
__srcu_read_unlock_fast(), so that is covered.

And then let's take look at this barrier() form of your store-buffering
litmus test:

	volatile int x = 0, y = 0

	CPU 0              CPU 1

	WRITE_ONCE(x, 1)   WRITE_ONCE(y, 1)
	barrier()          synchronize_rcu()
	r2 = READ_ONCE(y)  r4 = READ_ONCE(x)

	BUG_ON(r2 == 0 && r4 == 0)

Each of CPU 0's _ONCE() accesses is implemented with a single instruction,
and each therefore cannot be interrupted.  Each therefore acts as a tiny
RCU read-side critical section.

Now, if you put this into an LKMM litmus test, herd7 will in fact say
"Sometimes".  But that is because LKMM does not know about interrupts,
let alone the possibility that they might be disabled, and let alone the
fact that atomic operations act like tiny regions of interrupt-disabled
code, let alone the fact that regions of interrupt-disabled code act as
RCU read-side critical sections.

But we can help LKMM out by manually placing the tiny RCU read-side
critical sections, like this:

	C C-SB-o-b-o+o-sr-o

	{
	}

	P0(int *x, int *y)
	{
		rcu_read_lock();
		WRITE_ONCE(*x, 1);
		rcu_read_unlock();
		barrier();
		rcu_read_lock();
		r2 = READ_ONCE(*y);
		rcu_read_unlock();
	}

	P1(int *x, int *y)
	{
		WRITE_ONCE(*y, 1);
		synchronize_rcu();
		r4 = READ_ONCE(*x);
	}

	exists (0:r2=0 /\ 1:r4=0)

And if we do this, here is what LKMM has to say about it:

$ herd7 -conf linux-kernel.cfg /tmp/C-SB-o-b-o+o-sr-o.litmus

	Test C-SB-o-b-o+o-sr-o Allowed
	States 3
	0:r2=0; 1:r4=1;
	0:r2=1; 1:r4=0;
	0:r2=1; 1:r4=1;
	No
	Witnesses
	Positive: 0 Negative: 3
	Condition exists (0:r2=0 /\ 1:r4=0)
	Observation C-SB-o-b-o+o-sr-o Never 0 3
	Time C-SB-o-b-o+o-sr-o 0.01
	Hash=f607a3688b66756d2e85042c75d8c1fa

It says "Never", so SRCU-fast should be good from this memory-ordering
viewpoint.

Or am I missing your point?

> And what we want to order here on the read-side is the lock/unlock
> increments vs the memory accesses within the critical section, but
> there is no RCU read-side that contain all those memory accesses
> that match those synchronize_rcu calls, so the promotion from barrier
> to smp_mb don't appear to be valid.
> 
> But perhaps there is something more that is specific to the SRCU
> algorithm that I missing here ?

Again, the RCU reader implied by the tiny interrupts-disabled region of
code implied by an atomic operation should do the trick.

If you are instead accusing me of using a very subtle and tricky
synchronization technique, then I plead guilty to charges as read.
On the other hand, this is the Linux-kernel RCU implementation, so it is
not like people reading this code haven't been at least implicitly warned.

I did provide comments attempting to describe this, for example,
for __srcu_read_lock_fast():

 * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
 * critical sections either because they disables interrupts, because they
 * are a single instruction, or because they are a read-modify-write atomic
 * operation, depending on the whims of the architecture.

I would welcome upgrades that more clearly describe this trick.
The s/disables/disable/ would be one step in the right direction.  :-/

And again, many thanks for digging into this!!!

And again, am I missing your point?

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Either way, many thanks for digging into this!!!
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > https://www.efficios.com
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
  2025-07-16 22:54                               ` Paul E. McKenney
  2025-07-17 13:14                                 ` Mathieu Desnoyers
  2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
@ 2025-07-19  0:28                                 ` Paul E. McKenney
  2 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2025-07-19  0:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
	linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
	Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
	Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf

Hello!

This is version 2 of a patch series introducing SRCU-fast to the
__DECLARE_TRACE() in place of the current preemption disabling.
This change enable preemption of BPF programs attached to tracepoints
in in, as is required for runtime use of BPF in real-time systems.

The patches are as follows:

1.	Move rcu_is_watching() checks to srcu_read_{,un}lock_fast().

2.	Add srcu_read_lock_fast_notrace() and
	srcu_read_unlock_fast_notrace().

3.	Add guards for notrace variants of SRCU-fast readers.

4.	Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast.

Changes since RFC version:

o	RFC patch 6/4 has been pulled into the shared RCU tree:
	e88c632a8698 ("srcu: Add guards for SRCU-fast readers")

o	RFC patch 5/4 (which removed the now-unnecessary special boot-time
	avoidance of SRCU) has been folded into patch 4/4 shown above,
	as suggested by Steven Rostedt.

						Thanx, Paul

------------------------------------------------------------------------

 b/include/linux/srcu.h       |    4 ++++
 b/include/linux/srcutree.h   |    2 --
 b/include/linux/tracepoint.h |    6 ++++--
 b/kernel/tracepoint.c        |   21 ++++++++++++++++++++-
 include/linux/srcu.h         |   30 ++++++++++++++++++++++++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-07-19  0:28 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
2025-06-18 17:21   ` Boqun Feng
2025-06-20  8:43     ` Sebastian Andrzej Siewior
2025-06-20 11:23       ` Paul E. McKenney
2025-06-23 10:49         ` Sebastian Andrzej Siewior
2025-06-23 18:13           ` Paul E. McKenney
2025-07-07 21:56             ` Paul E. McKenney
2025-07-08 19:40               ` Mathieu Desnoyers
2025-07-08 20:49                 ` Paul E. McKenney
2025-07-09 14:31                   ` Mathieu Desnoyers
2025-07-09 18:33                     ` Paul E. McKenney
2025-07-11 13:46                       ` Mathieu Desnoyers
2025-07-11 17:05                         ` Paul E. McKenney
2025-07-14 16:34                           ` Paul E. McKenney
2025-07-15 19:56                             ` Mathieu Desnoyers
2025-07-15 23:23                               ` Paul E. McKenney
2025-07-15 19:54                           ` Mathieu Desnoyers
2025-07-15 23:18                             ` Paul E. McKenney
2025-07-16  0:42                               ` Paul E. McKenney
2025-07-16  4:41                                 ` Paul E. McKenney
2025-07-16 15:09                           ` Steven Rostedt
2025-07-16 20:35                             ` Paul E. McKenney
2025-07-16 22:54                               ` Paul E. McKenney
2025-07-17 13:14                                 ` Mathieu Desnoyers
2025-07-17 14:46                                   ` Mathieu Desnoyers
2025-07-17 15:18                                     ` Paul E. McKenney
2025-07-17 19:36                                       ` Mathieu Desnoyers
2025-07-17 21:27                                         ` Paul E. McKenney
2025-07-17 14:57                                   ` Alexei Starovoitov
2025-07-17 15:12                                     ` Steven Rostedt
2025-07-17 15:27                                       ` Alexei Starovoitov
2025-07-17 15:40                                         ` Steven Rostedt
2025-07-17 15:55                                           ` Steven Rostedt
2025-07-17 16:02                                             ` Alexei Starovoitov
2025-07-17 16:19                                               ` Steven Rostedt
2025-07-17 17:38                                               ` Mathieu Desnoyers
2025-07-17 16:04                                             ` Paul E. McKenney
2025-07-17 15:44                                         ` Paul E. McKenney
2025-07-17 15:30                                     ` Paul E. McKenney
2025-07-17 15:07                                   ` Paul E. McKenney
2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
2025-07-17 19:19                                   ` Steven Rostedt
2025-07-17 19:51                                     ` Paul E. McKenney
2025-07-17 19:56                                       ` Steven Rostedt
2025-07-17 20:38                                         ` Paul E. McKenney
2025-07-19  0:28                                 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2025-06-13 15:22 ` [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable() Sebastian Andrzej Siewior
2025-06-13 15:38 ` [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).