* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
[not found] ` <20250716110922.0dadc4ec@batman.local.home>
@ 2025-07-16 20:35 ` Paul E. McKenney
2025-07-16 22:54 ` Paul E. McKenney
0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-16 20:35 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> On Fri, 11 Jul 2025 10:05:26 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > note that preemption is disabled. If rcutree.use_softirq is set and
> > this task is blocking an expedited RCU grace period, it will directly
> > invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
> > it will directly invoke the non-notrace function irq_work_queue_on().
>
> Just to clarify some things; A function annotated by "notrace" simply
> will not have the ftrace hook to that function, but that function may
> very well have tracing triggered inside of it.
>
> Functions with "_notrace" in its name (like preempt_disable_notrace())
> should not have any tracing instrumentation (as Mathieu stated)
> inside of it, so that it can be used in the tracing infrastructure.
>
> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> tracing infrastructure call that, and we happen to enable that
> tracepoint, we will have:
>
> raise_softirq_irqoff()
> trace_softirq_raise()
> [..]
> raise_softirq_irqoff()
> trace_softirq_raise()
> [..]
> Ad infinitum!
>
> I'm not sure if that's what is being proposed or not, but I just wanted
> to make sure everyone is aware of the above.
OK, I *think* I might actually understand the problem. Maybe.
I am sure that the usual suspects will not be shy about correcting any
misapprehensions in the following. ;-)
My guess is that some users of real-time Linux would like to use BPF
programs while still getting decent latencies out of their systems.
(Not something I would have predicted, but then again, I was surprised
some years back to see people with a 4096-CPU system complaining about
200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
made some changes some years back to support this, perhaps most notably
replacing some uses of preempt_disable() with migrate_disable().
Except that the current __DECLARE_TRACE() macro defeats this work
for tracepoints by disabling preemption across the tracepoint call,
which might well be a BPF program. So we need to do something to
__DECLARE_TRACE() to get the right sort of protection while still leaving
preemption enabled.
One way of attacking this problem is to use preemptible RCU. The problem
with this is that although one could construct a trace-safe version
of rcu_read_unlock(), these would negate some optimizations that Lai
Jiangshan worked so hard to put in place. Plus those optimizations
also simplified the code quite a bit. Which is why I was pushing back
so hard, especially given that I did not realize that real-time systems
would be running BPF programs concurrently with real-time applications.
This meant that I was looking for a functional problem with the current
disabling of preemption, and not finding it.
So another way of dealing with this is to use SRCU-fast, which is
like SRCU, but dispenses with the smp_mb() calls and the redundant
read-side array indexing. Plus it is easy to make _notrace variants
srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
along with the requisite guards.
Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
Remove SRCU protection"), and I have hacked together this and the
prerequisites mentioned in the previous paragraph.
These are passing ridiculously light testing, but probably have at
least their share of bugs.
But first, do I actually finally understand the problem?
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-16 20:35 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
@ 2025-07-16 22:54 ` Paul E. McKenney
2025-07-17 13:14 ` Mathieu Desnoyers
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-16 22:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > On Fri, 11 Jul 2025 10:05:26 -0700
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >
> > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > note that preemption is disabled. If rcutree.use_softirq is set and
> > > this task is blocking an expedited RCU grace period, it will directly
> > > invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
> > > it will directly invoke the non-notrace function irq_work_queue_on().
> >
> > Just to clarify some things; A function annotated by "notrace" simply
> > will not have the ftrace hook to that function, but that function may
> > very well have tracing triggered inside of it.
> >
> > Functions with "_notrace" in its name (like preempt_disable_notrace())
> > should not have any tracing instrumentation (as Mathieu stated)
> > inside of it, so that it can be used in the tracing infrastructure.
> >
> > raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > tracing infrastructure call that, and we happen to enable that
> > tracepoint, we will have:
> >
> > raise_softirq_irqoff()
> > trace_softirq_raise()
> > [..]
> > raise_softirq_irqoff()
> > trace_softirq_raise()
> > [..]
> > Ad infinitum!
> >
> > I'm not sure if that's what is being proposed or not, but I just wanted
> > to make sure everyone is aware of the above.
>
> OK, I *think* I might actually understand the problem. Maybe.
>
> I am sure that the usual suspects will not be shy about correcting any
> misapprehensions in the following. ;-)
>
> My guess is that some users of real-time Linux would like to use BPF
> programs while still getting decent latencies out of their systems.
> (Not something I would have predicted, but then again, I was surprised
> some years back to see people with a 4096-CPU system complaining about
> 200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
> made some changes some years back to support this, perhaps most notably
> replacing some uses of preempt_disable() with migrate_disable().
>
> Except that the current __DECLARE_TRACE() macro defeats this work
> for tracepoints by disabling preemption across the tracepoint call,
> which might well be a BPF program. So we need to do something to
> __DECLARE_TRACE() to get the right sort of protection while still leaving
> preemption enabled.
>
> One way of attacking this problem is to use preemptible RCU. The problem
> with this is that although one could construct a trace-safe version
> of rcu_read_unlock(), these would negate some optimizations that Lai
> Jiangshan worked so hard to put in place. Plus those optimizations
> also simplified the code quite a bit. Which is why I was pushing back
> so hard, especially given that I did not realize that real-time systems
> would be running BPF programs concurrently with real-time applications.
> This meant that I was looking for a functional problem with the current
> disabling of preemption, and not finding it.
>
> So another way of dealing with this is to use SRCU-fast, which is
> like SRCU, but dispenses with the smp_mb() calls and the redundant
> read-side array indexing. Plus it is easy to make _notrace variants
> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> along with the requisite guards.
>
> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> Remove SRCU protection"), and I have hacked together this and the
> prerequisites mentioned in the previous paragraph.
>
> These are passing ridiculously light testing, but probably have at
> least their share of bugs.
>
> But first, do I actually finally understand the problem?
OK, they pass somewhat less ridiculously moderate testing, though I have
not yet hit them over the head with the ftrace selftests.
So might as well post them.
Thoughts?
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-16 22:54 ` Paul E. McKenney
@ 2025-07-17 13:14 ` Mathieu Desnoyers
2025-07-17 14:46 ` Mathieu Desnoyers
` (2 more replies)
2025-07-17 19:04 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
2025-07-19 0:28 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2 siblings, 3 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 13:14 UTC (permalink / raw)
To: paulmck, Steven Rostedt
Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2025-07-16 18:54, Paul E. McKenney wrote:
> On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
>> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
>>> On Fri, 11 Jul 2025 10:05:26 -0700
>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>>
>>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
>>>> note that preemption is disabled. If rcutree.use_softirq is set and
>>>> this task is blocking an expedited RCU grace period, it will directly
>>>> invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
>>>> it will directly invoke the non-notrace function irq_work_queue_on().
>>>
>>> Just to clarify some things; A function annotated by "notrace" simply
>>> will not have the ftrace hook to that function, but that function may
>>> very well have tracing triggered inside of it.
>>>
>>> Functions with "_notrace" in its name (like preempt_disable_notrace())
>>> should not have any tracing instrumentation (as Mathieu stated)
>>> inside of it, so that it can be used in the tracing infrastructure.
>>>
>>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
>>> tracing infrastructure call that, and we happen to enable that
>>> tracepoint, we will have:
>>>
>>> raise_softirq_irqoff()
>>> trace_softirq_raise()
>>> [..]
>>> raise_softirq_irqoff()
>>> trace_softirq_raise()
>>> [..]
>>> Ad infinitum!
>>>
>>> I'm not sure if that's what is being proposed or not, but I just wanted
>>> to make sure everyone is aware of the above.
>>
>> OK, I *think* I might actually understand the problem. Maybe.
>>
>> I am sure that the usual suspects will not be shy about correcting any
>> misapprehensions in the following. ;-)
>>
>> My guess is that some users of real-time Linux would like to use BPF
>> programs while still getting decent latencies out of their systems.
>> (Not something I would have predicted, but then again, I was surprised
>> some years back to see people with a 4096-CPU system complaining about
>> 200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
>> made some changes some years back to support this, perhaps most notably
>> replacing some uses of preempt_disable() with migrate_disable().
>>
>> Except that the current __DECLARE_TRACE() macro defeats this work
>> for tracepoints by disabling preemption across the tracepoint call,
>> which might well be a BPF program. So we need to do something to
>> __DECLARE_TRACE() to get the right sort of protection while still leaving
>> preemption enabled.
>>
>> One way of attacking this problem is to use preemptible RCU. The problem
>> with this is that although one could construct a trace-safe version
>> of rcu_read_unlock(), these would negate some optimizations that Lai
>> Jiangshan worked so hard to put in place. Plus those optimizations
>> also simplified the code quite a bit. Which is why I was pushing back
>> so hard, especially given that I did not realize that real-time systems
>> would be running BPF programs concurrently with real-time applications.
>> This meant that I was looking for a functional problem with the current
>> disabling of preemption, and not finding it.
>>
>> So another way of dealing with this is to use SRCU-fast, which is
>> like SRCU, but dispenses with the smp_mb() calls and the redundant
>> read-side array indexing. Plus it is easy to make _notrace variants
>> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
>> along with the requisite guards.
>>
>> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
>> Remove SRCU protection"), and I have hacked together this and the
>> prerequisites mentioned in the previous paragraph.
>>
>> These are passing ridiculously light testing, but probably have at
>> least their share of bugs.
>>
>> But first, do I actually finally understand the problem?
>
> OK, they pass somewhat less ridiculously moderate testing, though I have
> not yet hit them over the head with the ftrace selftests.
>
> So might as well post them.
>
> Thoughts?
Your explanation of the problem context fits my understanding.
Note that I've mostly been pulled into this by Sebastian who wanted
to understand better the how we could make the tracepoint
instrumentation work with bpf probes that need to sleep due to
locking. Hence my original somewhat high-level desiderata.
I'm glad this seems to be converging towards a concrete solution.
There are two things I'm wondering:
1) Would we want to always use srcu-fast (for both preempt and
non-preempt kernels ?), or is there any downside compared to
preempt-off rcu ? (e.g. overhead ?)
If the overhead is similar when actually used by tracers
(I'm talking about actual workload benchmark and not a
microbenchmark), I would tend to err towards simplicity
and to minimize the number of configurations to test, and
use srcu-fast everywhere.
2) I think I'm late to the party in reviewing srcu-fast, I'll
go have a look :)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 13:14 ` Mathieu Desnoyers
@ 2025-07-17 14:46 ` Mathieu Desnoyers
2025-07-17 15:18 ` Paul E. McKenney
2025-07-17 14:57 ` Alexei Starovoitov
2025-07-17 15:07 ` Paul E. McKenney
2 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 14:46 UTC (permalink / raw)
To: paulmck, Steven Rostedt
Cc: Sebastian Andrzej Siewior, Boqun Feng, linux-rt-devel, rcu,
linux-trace-kernel, Frederic Weisbecker, Joel Fernandes,
Josh Triplett, Lai Jiangshan, Masami Hiramatsu, Neeraj Upadhyay,
Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> On 2025-07-16 18:54, Paul E. McKenney wrote:
[...]
>
> 2) I think I'm late to the party in reviewing srcu-fast, I'll
> go have a look :)
OK, I'll bite. :) Please let me know where I'm missing something:
Looking at srcu-lite and srcu-fast, I understand that they fundamentally
depend on a trick we published here https://lwn.net/Articles/573497/
"The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
volatile int x = 0, y = 0
CPU 0 CPU 1
x = 1 y = 1
smp_mb smp_mb
r2 = y r4 = x
BUG_ON(r2 == 0 && r4 == 0)
into
volatile int x = 0, y = 0
CPU 0 CPU 1
rcu_read_lock()
x = 1 y = 1
synchronize_rcu()
r2 = y r4 = x
rcu_read_unlock()
BUG_ON(r2 == 0 && r4 == 0)
So looking at srcu-fast, we have:
* Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
* critical sections either because they disables interrupts, because they
* are a single instruction, or because they are a read-modify-write atomic
* operation, depending on the whims of the architecture.
It appears to be pairing, as RCU read-side:
- irq off/on implied by this_cpu_inc
- atomic
- single instruction
with synchronize_rcu within the grace period, and hope that this behaves as a
smp_mb pairing preventing the srcu read-side critical section from leaking
out of the srcu read lock/unlock.
I note that there is a validation that rcu_is_watching() within
__srcu_read_lock_fast, but it's one thing to have rcu watching, but
another to have an actual read-side critical section. Note that
preemption, irqs, softirqs can very well be enabled when calling
__srcu_read_lock_fast.
My understanding of the how memory barriers implemented with RCU
work is that we need to surround the memory accesses on the fast-path
(where we turn smp_mb into barrier) with an RCU read-side critical
section to make sure it does not spawn across a synchronize_rcu.
What I am missing here is how can a RCU side-side that only consist
of the irq off/on or atomic or single instruction cover all memory
accesses we are trying to order, namely those within the srcu
critical section after the compiler barrier() ? Is having RCU
watching sufficient to guarantee this ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 13:14 ` Mathieu Desnoyers
2025-07-17 14:46 ` Mathieu Desnoyers
@ 2025-07-17 14:57 ` Alexei Starovoitov
2025-07-17 15:12 ` Steven Rostedt
2025-07-17 15:30 ` Paul E. McKenney
2025-07-17 15:07 ` Paul E. McKenney
2 siblings, 2 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 14:57 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Paul E. McKenney, Steven Rostedt, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-07-16 18:54, Paul E. McKenney wrote:
> > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> >>> On Fri, 11 Jul 2025 10:05:26 -0700
> >>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>>
> >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> >>>> note that preemption is disabled. If rcutree.use_softirq is set and
> >>>> this task is blocking an expedited RCU grace period, it will directly
> >>>> invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
> >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> >>>
> >>> Just to clarify some things; A function annotated by "notrace" simply
> >>> will not have the ftrace hook to that function, but that function may
> >>> very well have tracing triggered inside of it.
> >>>
> >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> >>> should not have any tracing instrumentation (as Mathieu stated)
> >>> inside of it, so that it can be used in the tracing infrastructure.
> >>>
> >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> >>> tracing infrastructure call that, and we happen to enable that
> >>> tracepoint, we will have:
> >>>
> >>> raise_softirq_irqoff()
> >>> trace_softirq_raise()
> >>> [..]
> >>> raise_softirq_irqoff()
> >>> trace_softirq_raise()
> >>> [..]
> >>> Ad infinitum!
> >>>
> >>> I'm not sure if that's what is being proposed or not, but I just wanted
> >>> to make sure everyone is aware of the above.
> >>
> >> OK, I *think* I might actually understand the problem. Maybe.
> >>
> >> I am sure that the usual suspects will not be shy about correcting any
> >> misapprehensions in the following. ;-)
> >>
> >> My guess is that some users of real-time Linux would like to use BPF
> >> programs while still getting decent latencies out of their systems.
> >> (Not something I would have predicted, but then again, I was surprised
> >> some years back to see people with a 4096-CPU system complaining about
> >> 200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
> >> made some changes some years back to support this, perhaps most notably
> >> replacing some uses of preempt_disable() with migrate_disable().
> >>
> >> Except that the current __DECLARE_TRACE() macro defeats this work
> >> for tracepoints by disabling preemption across the tracepoint call,
> >> which might well be a BPF program. So we need to do something to
> >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> >> preemption enabled.
> >>
> >> One way of attacking this problem is to use preemptible RCU. The problem
> >> with this is that although one could construct a trace-safe version
> >> of rcu_read_unlock(), these would negate some optimizations that Lai
> >> Jiangshan worked so hard to put in place. Plus those optimizations
> >> also simplified the code quite a bit. Which is why I was pushing back
> >> so hard, especially given that I did not realize that real-time systems
> >> would be running BPF programs concurrently with real-time applications.
> >> This meant that I was looking for a functional problem with the current
> >> disabling of preemption, and not finding it.
> >>
> >> So another way of dealing with this is to use SRCU-fast, which is
> >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> >> read-side array indexing. Plus it is easy to make _notrace variants
> >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> >> along with the requisite guards.
> >>
> >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> >> Remove SRCU protection"), and I have hacked together this and the
> >> prerequisites mentioned in the previous paragraph.
> >>
> >> These are passing ridiculously light testing, but probably have at
> >> least their share of bugs.
> >>
> >> But first, do I actually finally understand the problem?
> >
> > OK, they pass somewhat less ridiculously moderate testing, though I have
> > not yet hit them over the head with the ftrace selftests.
> >
> > So might as well post them.
> >
> > Thoughts?
>
> Your explanation of the problem context fits my understanding.
>
> Note that I've mostly been pulled into this by Sebastian who wanted
> to understand better the how we could make the tracepoint
> instrumentation work with bpf probes that need to sleep due to
> locking. Hence my original somewhat high-level desiderata.
I still don't understand what problem is being solved.
As current tracepoint code stands there is no issue with it at all
on PREEMPT_RT from bpf pov.
bpf progs that attach to tracepoints are not sleepable.
They don't call rt_spinlock either.
Recognizing tracepoints that can sleep/fault and allow
sleepable bpf progs there is on our to do list,
but afaik it doesn't need any changes to tracepoint infra.
There is no need to replace existing preempt_disable wrappers
with sleepable srcu_fast or anything else.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 13:14 ` Mathieu Desnoyers
2025-07-17 14:46 ` Mathieu Desnoyers
2025-07-17 14:57 ` Alexei Starovoitov
@ 2025-07-17 15:07 ` Paul E. McKenney
2 siblings, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:07 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 09:14:41AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-16 18:54, Paul E. McKenney wrote:
> > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > > > On Fri, 11 Jul 2025 10:05:26 -0700
> > > > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > >
> > > > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > > > > note that preemption is disabled. If rcutree.use_softirq is set and
> > > > > this task is blocking an expedited RCU grace period, it will directly
> > > > > invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
> > > > > it will directly invoke the non-notrace function irq_work_queue_on().
> > > >
> > > > Just to clarify some things; A function annotated by "notrace" simply
> > > > will not have the ftrace hook to that function, but that function may
> > > > very well have tracing triggered inside of it.
> > > >
> > > > Functions with "_notrace" in its name (like preempt_disable_notrace())
> > > > should not have any tracing instrumentation (as Mathieu stated)
> > > > inside of it, so that it can be used in the tracing infrastructure.
> > > >
> > > > raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > > > tracing infrastructure call that, and we happen to enable that
> > > > tracepoint, we will have:
> > > >
> > > > raise_softirq_irqoff()
> > > > trace_softirq_raise()
> > > > [..]
> > > > raise_softirq_irqoff()
> > > > trace_softirq_raise()
> > > > [..]
> > > > Ad infinitum!
> > > >
> > > > I'm not sure if that's what is being proposed or not, but I just wanted
> > > > to make sure everyone is aware of the above.
> > >
> > > OK, I *think* I might actually understand the problem. Maybe.
> > >
> > > I am sure that the usual suspects will not be shy about correcting any
> > > misapprehensions in the following. ;-)
> > >
> > > My guess is that some users of real-time Linux would like to use BPF
> > > programs while still getting decent latencies out of their systems.
> > > (Not something I would have predicted, but then again, I was surprised
> > > some years back to see people with a 4096-CPU system complaining about
> > > 200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
> > > made some changes some years back to support this, perhaps most notably
> > > replacing some uses of preempt_disable() with migrate_disable().
> > >
> > > Except that the current __DECLARE_TRACE() macro defeats this work
> > > for tracepoints by disabling preemption across the tracepoint call,
> > > which might well be a BPF program. So we need to do something to
> > > __DECLARE_TRACE() to get the right sort of protection while still leaving
> > > preemption enabled.
> > >
> > > One way of attacking this problem is to use preemptible RCU. The problem
> > > with this is that although one could construct a trace-safe version
> > > of rcu_read_unlock(), these would negate some optimizations that Lai
> > > Jiangshan worked so hard to put in place. Plus those optimizations
> > > also simplified the code quite a bit. Which is why I was pushing back
> > > so hard, especially given that I did not realize that real-time systems
> > > would be running BPF programs concurrently with real-time applications.
> > > This meant that I was looking for a functional problem with the current
> > > disabling of preemption, and not finding it.
> > >
> > > So another way of dealing with this is to use SRCU-fast, which is
> > > like SRCU, but dispenses with the smp_mb() calls and the redundant
> > > read-side array indexing. Plus it is easy to make _notrace variants
> > > srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> > > along with the requisite guards.
> > >
> > > Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> > > Remove SRCU protection"), and I have hacked together this and the
> > > prerequisites mentioned in the previous paragraph.
> > >
> > > These are passing ridiculously light testing, but probably have at
> > > least their share of bugs.
> > >
> > > But first, do I actually finally understand the problem?
> >
> > OK, they pass somewhat less ridiculously moderate testing, though I have
> > not yet hit them over the head with the ftrace selftests.
> >
> > So might as well post them.
> >
> > Thoughts?
>
> Your explanation of the problem context fits my understanding.
>
> Note that I've mostly been pulled into this by Sebastian who wanted
> to understand better the how we could make the tracepoint
> instrumentation work with bpf probes that need to sleep due to
> locking. Hence my original somewhat high-level desiderata.
>
> I'm glad this seems to be converging towards a concrete solution.
>
> There are two things I'm wondering:
>
> 1) Would we want to always use srcu-fast (for both preempt and
> non-preempt kernels ?), or is there any downside compared to
> preempt-off rcu ? (e.g. overhead ?)
For kernels built with CONFIG_PREEMPT_DYNAMIC=n and either
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, non-preemptible
RCU would be faster. I did consider this, but decided to keep the
initial patch simple.
> If the overhead is similar when actually used by tracers
> (I'm talking about actual workload benchmark and not a
> microbenchmark), I would tend to err towards simplicity
> and to minimize the number of configurations to test, and
> use srcu-fast everywhere.
To this point, I was wondering whether it is still necessary to do the
call_rcu() stage, but left it because that is the safe mistake to make.
I am testing a fifth patch that removes the early-boot deferral of
call_srcu() because call_srcu() now does exactly this deferral internally.
> 2) I think I'm late to the party in reviewing srcu-fast, I'll
> go have a look :)
Looking forward to seeing what you come up with!
I deferred one further optimization, namely statically classifying
srcu_struct structures as intended for vanilla, _nmisafe(), or _fast()
use, or at least doing so at initialization time. This would get rid
of the call to srcu_check_read_flavor_force() in srcu_read_lock_fast()
srcu_read_unlock_fast(), and friends, or at least to tuck it under
CONFIG_PROVE_RCU. On my laptop, this saves an additional 25%, though
that 25% amounts to a big half of a nanosecond.
Thoughts?
Thanx, Paul
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 14:57 ` Alexei Starovoitov
@ 2025-07-17 15:12 ` Steven Rostedt
2025-07-17 15:27 ` Alexei Starovoitov
2025-07-17 15:30 ` Paul E. McKenney
1 sibling, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:12 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 07:57:27 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> I still don't understand what problem is being solved.
> As current tracepoint code stands there is no issue with it at all
> on PREEMPT_RT from bpf pov.
> bpf progs that attach to tracepoints are not sleepable.
> They don't call rt_spinlock either.
> Recognizing tracepoints that can sleep/fault and allow
> sleepable bpf progs there is on our to do list,
> but afaik it doesn't need any changes to tracepoint infra.
> There is no need to replace existing preempt_disable wrappers
> with sleepable srcu_fast or anything else.
From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
may stop migration, but if someone adds a long running BPF program
(when I say long running, it could be anything more than 10us), and it
executes on a low priority task. If that BPF program is not preemptable
it can delay a high priority task from running. That defeats the
purpose of PREEMPT_RT.
If this is unsolvable, then we will need to make PREEMPT_RT and BPF
mutually exclusive in the configs.
-- Steve
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 14:46 ` Mathieu Desnoyers
@ 2025-07-17 15:18 ` Paul E. McKenney
2025-07-17 19:36 ` Mathieu Desnoyers
2026-01-06 15:08 ` Mathieu Desnoyers
0 siblings, 2 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:18 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > On 2025-07-16 18:54, Paul E. McKenney wrote:
> [...]
> >
> > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> > go have a look :)
>
> OK, I'll bite. :) Please let me know where I'm missing something:
>
> Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> depend on a trick we published here https://lwn.net/Articles/573497/
> "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
>
> volatile int x = 0, y = 0
>
> CPU 0 CPU 1
>
> x = 1 y = 1
> smp_mb smp_mb
> r2 = y r4 = x
>
> BUG_ON(r2 == 0 && r4 == 0)
>
> into
>
> volatile int x = 0, y = 0
>
> CPU 0 CPU 1
>
> rcu_read_lock()
> x = 1 y = 1
> synchronize_rcu()
> r2 = y r4 = x
> rcu_read_unlock()
>
> BUG_ON(r2 == 0 && r4 == 0)
>
> So looking at srcu-fast, we have:
>
> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> * critical sections either because they disables interrupts, because they
> * are a single instruction, or because they are a read-modify-write atomic
> * operation, depending on the whims of the architecture.
>
> It appears to be pairing, as RCU read-side:
>
> - irq off/on implied by this_cpu_inc
> - atomic
> - single instruction
>
> with synchronize_rcu within the grace period, and hope that this behaves as a
> smp_mb pairing preventing the srcu read-side critical section from leaking
> out of the srcu read lock/unlock.
>
> I note that there is a validation that rcu_is_watching() within
> __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> another to have an actual read-side critical section. Note that
> preemption, irqs, softirqs can very well be enabled when calling
> __srcu_read_lock_fast.
>
> My understanding of the how memory barriers implemented with RCU
> work is that we need to surround the memory accesses on the fast-path
> (where we turn smp_mb into barrier) with an RCU read-side critical
> section to make sure it does not spawn across a synchronize_rcu.
>
> What I am missing here is how can a RCU side-side that only consist
> of the irq off/on or atomic or single instruction cover all memory
> accesses we are trying to order, namely those within the srcu
> critical section after the compiler barrier() ? Is having RCU
> watching sufficient to guarantee this ?
Good eyes!!!
The trick is that this "RCU read-side critical section" consists only of
either this_cpu_inc() or atomic_long_inc(), with the latter only happening
in systems that have NMIs, but don't have NMI-safe per-CPU operations.
Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
thus both act as an interrupts-disabled RCU read-side critical section.
Therefore, if the SRCU grace-period computation fails to see an
srcu_read_lock_fast() increment, its earlier code is guaranteed to
happen before the corresponding critical section. Similarly, if the SRCU
grace-period computation sees an srcu_read_unlock_fast(), its subsequent
code is guaranteed to happen after the corresponding critical section.
Does that help? If so, would you be interested and nominating a comment?
Or am I missing something subtle here?
Either way, many thanks for digging into this!!!
Thanx, Paul
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:12 ` Steven Rostedt
@ 2025-07-17 15:27 ` Alexei Starovoitov
2025-07-17 15:40 ` Steven Rostedt
2025-07-17 15:44 ` Paul E. McKenney
0 siblings, 2 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 15:27 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 17 Jul 2025 07:57:27 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > I still don't understand what problem is being solved.
> > As current tracepoint code stands there is no issue with it at all
> > on PREEMPT_RT from bpf pov.
> > bpf progs that attach to tracepoints are not sleepable.
> > They don't call rt_spinlock either.
> > Recognizing tracepoints that can sleep/fault and allow
> > sleepable bpf progs there is on our to do list,
> > but afaik it doesn't need any changes to tracepoint infra.
> > There is no need to replace existing preempt_disable wrappers
> > with sleepable srcu_fast or anything else.
>
> From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
> may stop migration, but if someone adds a long running BPF program
> (when I say long running, it could be anything more than 10us), and it
> executes on a low priority task. If that BPF program is not preemptable
> it can delay a high priority task from running. That defeats the
> purpose of PREEMPT_RT.
>
> If this is unsolvable, then we will need to make PREEMPT_RT and BPF
> mutually exclusive in the configs.
Stop this fud, please.
bpf progs were preemptible for years and had no issue in RT.
tracepoints are using preempt_disable() still and that's a
tracepoint infra problem. Nothing to do with users of tracepoints.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 14:57 ` Alexei Starovoitov
2025-07-17 15:12 ` Steven Rostedt
@ 2025-07-17 15:30 ` Paul E. McKenney
1 sibling, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:30 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, Steven Rostedt, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 07:57:27AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> > >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > >>> On Fri, 11 Jul 2025 10:05:26 -0700
> > >>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >>>
> > >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > >>>> note that preemption is disabled. If rcutree.use_softirq is set and
> > >>>> this task is blocking an expedited RCU grace period, it will directly
> > >>>> invoke the non-notrace function raise_softirq_irqoff(). Otherwise,
> > >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> > >>>
> > >>> Just to clarify some things; A function annotated by "notrace" simply
> > >>> will not have the ftrace hook to that function, but that function may
> > >>> very well have tracing triggered inside of it.
> > >>>
> > >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> > >>> should not have any tracing instrumentation (as Mathieu stated)
> > >>> inside of it, so that it can be used in the tracing infrastructure.
> > >>>
> > >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > >>> tracing infrastructure call that, and we happen to enable that
> > >>> tracepoint, we will have:
> > >>>
> > >>> raise_softirq_irqoff()
> > >>> trace_softirq_raise()
> > >>> [..]
> > >>> raise_softirq_irqoff()
> > >>> trace_softirq_raise()
> > >>> [..]
> > >>> Ad infinitum!
> > >>>
> > >>> I'm not sure if that's what is being proposed or not, but I just wanted
> > >>> to make sure everyone is aware of the above.
> > >>
> > >> OK, I *think* I might actually understand the problem. Maybe.
> > >>
> > >> I am sure that the usual suspects will not be shy about correcting any
> > >> misapprehensions in the following. ;-)
> > >>
> > >> My guess is that some users of real-time Linux would like to use BPF
> > >> programs while still getting decent latencies out of their systems.
> > >> (Not something I would have predicted, but then again, I was surprised
> > >> some years back to see people with a 4096-CPU system complaining about
> > >> 200-microsecond latency blows from RCU.) And the BPF guys (now CCed)
> > >> made some changes some years back to support this, perhaps most notably
> > >> replacing some uses of preempt_disable() with migrate_disable().
> > >>
> > >> Except that the current __DECLARE_TRACE() macro defeats this work
> > >> for tracepoints by disabling preemption across the tracepoint call,
> > >> which might well be a BPF program. So we need to do something to
> > >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> > >> preemption enabled.
> > >>
> > >> One way of attacking this problem is to use preemptible RCU. The problem
> > >> with this is that although one could construct a trace-safe version
> > >> of rcu_read_unlock(), these would negate some optimizations that Lai
> > >> Jiangshan worked so hard to put in place. Plus those optimizations
> > >> also simplified the code quite a bit. Which is why I was pushing back
> > >> so hard, especially given that I did not realize that real-time systems
> > >> would be running BPF programs concurrently with real-time applications.
> > >> This meant that I was looking for a functional problem with the current
> > >> disabling of preemption, and not finding it.
> > >>
> > >> So another way of dealing with this is to use SRCU-fast, which is
> > >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> > >> read-side array indexing. Plus it is easy to make _notrace variants
> > >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> > >> along with the requisite guards.
> > >>
> > >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> > >> Remove SRCU protection"), and I have hacked together this and the
> > >> prerequisites mentioned in the previous paragraph.
> > >>
> > >> These are passing ridiculously light testing, but probably have at
> > >> least their share of bugs.
> > >>
> > >> But first, do I actually finally understand the problem?
> > >
> > > OK, they pass somewhat less ridiculously moderate testing, though I have
> > > not yet hit them over the head with the ftrace selftests.
> > >
> > > So might as well post them.
> > >
> > > Thoughts?
> >
> > Your explanation of the problem context fits my understanding.
> >
> > Note that I've mostly been pulled into this by Sebastian who wanted
> > to understand better the how we could make the tracepoint
> > instrumentation work with bpf probes that need to sleep due to
> > locking. Hence my original somewhat high-level desiderata.
>
> I still don't understand what problem is being solved.
> As current tracepoint code stands there is no issue with it at all
> on PREEMPT_RT from bpf pov.
> bpf progs that attach to tracepoints are not sleepable.
> They don't call rt_spinlock either.
> Recognizing tracepoints that can sleep/fault and allow
> sleepable bpf progs there is on our to do list,
> but afaik it doesn't need any changes to tracepoint infra.
> There is no need to replace existing preempt_disable wrappers
> with sleepable srcu_fast or anything else.
As I understand it, functionally there is no problem.
And if the BPF program has been attached to preemption-disabled code,
there is also no unnecessary damage to real-time latency. Preemption is
presumably disabled for a reason, and that reason must be respected.
Presumably, people using BPF in real-time systems are careful to either
avoid attaching BPF programs to non-preemptible code on the one hand or
carefully limit the runtime of those BPF programs.
But if the BPF program has been attached to code having preemption
enabled, then the current call to guard(preempt_notrace)() within
__DECLARE_TRACE() means that the entire BPF program is running with
preemption disabled, and for no good reason.
Although this is not a functional issue, it is a first-class bug in
terms of needlessly degrading real-time latency.
And even though it isn't what Sebastian originally asked Mathieu for
help with, it still needs to be fixed.
My current offer is replacing that guard(preempt_notrace)() with its
SRCU-fast counterpart of guard(srcu_fast_notrace)(&tracepoint_srcu).
Light testing is going well thus far, though I clearly need the real-time
guys to both review and test.
I am sure that Mathieu, Sebastian, and the rest won't be shy about
correcting any confusion on my part. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:27 ` Alexei Starovoitov
@ 2025-07-17 15:40 ` Steven Rostedt
2025-07-17 15:55 ` Steven Rostedt
2025-07-17 15:44 ` Paul E. McKenney
1 sibling, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:40 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 08:27:24 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 17 Jul 2025 07:57:27 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > I still don't understand what problem is being solved.
> > > As current tracepoint code stands there is no issue with it at all
> > > on PREEMPT_RT from bpf pov.
> > > bpf progs that attach to tracepoints are not sleepable.
>
> Stop this fud, please.
No fud. Above you said they are not sleepable. I guess that just means
they don't call schedule?
>
> bpf progs were preemptible for years and had no issue in RT.
> tracepoints are using preempt_disable() still and that's a
> tracepoint infra problem. Nothing to do with users of tracepoints.
Yes, it is a tracepoint infra problem that we are trying to solve. The
reason we are trying to solve it is because BPF programs can extend the
time a tracepoint takes. If anything else extended the time, this would
need to be solved as well. But currently it's only BPF programs that
cause the issue.
Let me explain the problem being solved then.
Tracepoints currently disable preemption with preempt_disable_notrace()
to allow things to be synchronized correctly. But with the disabling of
preemption for the users of tracepoints (BPF programs being one of
them), it can result in large latency for RT tasks.
The request is to use RCU instead as RCU in PREEMPT_RT can be
preempted. That means we need a rcu_read_lock_notrace() version. One
that doesn't call into any tracing infrastructure (like current
rcu_read_lock() does), otherwise there could be an infinite recursion.
The discussion at hand is how do we come up with a
rcu_read_lock_notrace(), and using rcu_read_lock_fast() may be one of
the solutions as it can be implemented without triggering tracing
infrastructure.
Does that explain it better?
-- Steve
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:27 ` Alexei Starovoitov
2025-07-17 15:40 ` Steven Rostedt
@ 2025-07-17 15:44 ` Paul E. McKenney
1 sibling, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 15:44 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Mathieu Desnoyers, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 08:27:24AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 17 Jul 2025 07:57:27 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > I still don't understand what problem is being solved.
> > > As current tracepoint code stands there is no issue with it at all
> > > on PREEMPT_RT from bpf pov.
> > > bpf progs that attach to tracepoints are not sleepable.
> > > They don't call rt_spinlock either.
> > > Recognizing tracepoints that can sleep/fault and allow
> > > sleepable bpf progs there is on our to do list,
> > > but afaik it doesn't need any changes to tracepoint infra.
> > > There is no need to replace existing preempt_disable wrappers
> > > with sleepable srcu_fast or anything else.
> >
> > From the PREEMPT_RT point of view, it wants BPF to be preemptable. It
> > may stop migration, but if someone adds a long running BPF program
> > (when I say long running, it could be anything more than 10us), and it
> > executes on a low priority task. If that BPF program is not preemptable
> > it can delay a high priority task from running. That defeats the
> > purpose of PREEMPT_RT.
> >
> > If this is unsolvable, then we will need to make PREEMPT_RT and BPF
> > mutually exclusive in the configs.
>
> Stop this fud, please.
>
> bpf progs were preemptible for years and had no issue in RT.
> tracepoints are using preempt_disable() still and that's a
> tracepoint infra problem. Nothing to do with users of tracepoints.
And the tracepoint infrastructure is in fact where my proposed fix
is located.
To be fair, several upgrades to SRCU-fast were required to handle this
particular use case.
But to Steven's point, if there was no feasible fix, wherever that fix
might be, then users of real-time Linux would (at best!) need to be *very*
careful about how they used BPF. In fact, those having safety-critical
appliations might well choose to turn BPF off entirely, just to prevent
accidents.
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:40 ` Steven Rostedt
@ 2025-07-17 15:55 ` Steven Rostedt
2025-07-17 16:02 ` Alexei Starovoitov
2025-07-17 16:04 ` Paul E. McKenney
0 siblings, 2 replies; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 15:55 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 11:40:28 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> Yes, it is a tracepoint infra problem that we are trying to solve. The
> reason we are trying to solve it is because BPF programs can extend the
> time a tracepoint takes. If anything else extended the time, this would
> need to be solved as well. But currently it's only BPF programs that
> cause the issue.
BTW, if we can't solve this issue and something else came along and
attached to tracepoints that caused unbounded latency, I would also
argue that whatever came along would need to be prevented from being
configured with PREEMPT_RT. My comment wasn't a strike against BPF
programs; It was a strike against something adding unbounded latency
into a critical section that has preemption disabled.
-- Steve
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:55 ` Steven Rostedt
@ 2025-07-17 16:02 ` Alexei Starovoitov
2025-07-17 16:19 ` Steven Rostedt
2025-07-17 17:38 ` Mathieu Desnoyers
2025-07-17 16:04 ` Paul E. McKenney
1 sibling, 2 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 16:02 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 8:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 17 Jul 2025 11:40:28 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Yes, it is a tracepoint infra problem that we are trying to solve. The
> > reason we are trying to solve it is because BPF programs can extend the
> > time a tracepoint takes. If anything else extended the time, this would
> > need to be solved as well. But currently it's only BPF programs that
> > cause the issue.
>
> BTW, if we can't solve this issue and something else came along and
> attached to tracepoints that caused unbounded latency, I would also
> argue that whatever came along would need to be prevented from being
> configured with PREEMPT_RT. My comment wasn't a strike against BPF
> programs; It was a strike against something adding unbounded latency
> into a critical section that has preemption disabled.
Stop blaming the users. Tracepoints disable preemption for now
good reason and you keep shifting the blame.
Fix tracepoint infra.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:55 ` Steven Rostedt
2025-07-17 16:02 ` Alexei Starovoitov
@ 2025-07-17 16:04 ` Paul E. McKenney
1 sibling, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 16:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: Alexei Starovoitov, Mathieu Desnoyers, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 11:55:10AM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 11:40:28 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Yes, it is a tracepoint infra problem that we are trying to solve. The
> > reason we are trying to solve it is because BPF programs can extend the
> > time a tracepoint takes. If anything else extended the time, this would
> > need to be solved as well. But currently it's only BPF programs that
> > cause the issue.
>
> BTW, if we can't solve this issue and something else came along and
> attached to tracepoints that caused unbounded latency, I would also
> argue that whatever came along would need to be prevented from being
> configured with PREEMPT_RT. My comment wasn't a strike against BPF
> programs; It was a strike against something adding unbounded latency
> into a critical section that has preemption disabled.
I would imagine that many PREEMPT_RT installations would have a severe
and heavy-weight review, test, and approval process for BPF programs
intended for use in production systems...
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 16:02 ` Alexei Starovoitov
@ 2025-07-17 16:19 ` Steven Rostedt
2025-07-17 17:38 ` Mathieu Desnoyers
1 sibling, 0 replies; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 16:19 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, Paul E. McKenney, Sebastian Andrzej Siewior,
Boqun Feng, linux-rt-devel, rcu, linux-trace-kernel,
Frederic Weisbecker, Joel Fernandes, Josh Triplett, Lai Jiangshan,
Masami Hiramatsu, Neeraj Upadhyay, Thomas Gleixner,
Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 09:02:38 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> Stop blaming the users. Tracepoints disable preemption for now
> good reason and you keep shifting the blame.
> Fix tracepoint infra.
The fix *is* to have tracepoints not disable preemption. Your first
reply said:
There is no need to replace existing preempt_disable wrappers
with sleepable srcu_fast or anything else.
What is it? To replace the existing preemptable wrappers is the
solution. But you said it isn't needed.
Those using PREEMPT_RT look to blame the users that cause unbounded
latency. When you design an RT system, you look at what causes
unbounded latency and disable it.
Blaming the users is part of the process!
If BPF programs attached to tracepoints cannot be preempted, it means
it must not be used with PREEMPT_RT. FULL STOP!
It doesn't mean BPF programs can't be used. It just can't be used when
you care about bounded latency.
There's even people looking into configuring the kernel where if
anything gets called that can cause an unbounded latency, it would
trigger BUG(). As crashing the kernel in safety critical systems is
actually better than missing a deadline.
-- Steve
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 16:02 ` Alexei Starovoitov
2025-07-17 16:19 ` Steven Rostedt
@ 2025-07-17 17:38 ` Mathieu Desnoyers
1 sibling, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 17:38 UTC (permalink / raw)
To: Alexei Starovoitov, Steven Rostedt
Cc: Paul E. McKenney, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2025-07-17 12:02, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 8:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Thu, 17 Jul 2025 11:40:28 -0400
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> Yes, it is a tracepoint infra problem that we are trying to solve. The
>>> reason we are trying to solve it is because BPF programs can extend the
>>> time a tracepoint takes. If anything else extended the time, this would
>>> need to be solved as well. But currently it's only BPF programs that
>>> cause the issue.
>>
>> BTW, if we can't solve this issue and something else came along and
>> attached to tracepoints that caused unbounded latency, I would also
>> argue that whatever came along would need to be prevented from being
>> configured with PREEMPT_RT. My comment wasn't a strike against BPF
>> programs; It was a strike against something adding unbounded latency
>> into a critical section that has preemption disabled.
>
> Stop blaming the users. Tracepoints disable preemption for now
> good reason and you keep shifting the blame.
> Fix tracepoint infra.
I think we're all agreeing here that it's the tracepoint instrumentation
infrastructure that needs to be changed to become more RT friendly, and
that BPF is merely the tracepoint user that happens to have the longest
running callbacks at the moment.
So AFAIU there is no blame on BPF here, and no expectation for any
changes in BPF neither.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
2025-07-16 22:54 ` Paul E. McKenney
2025-07-17 13:14 ` Mathieu Desnoyers
@ 2025-07-17 19:04 ` Paul E. McKenney
2025-07-17 19:19 ` Steven Rostedt
2025-07-19 0:28 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 19:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
This adds the usual scoped_guard(srcu_fast, &my_srcu) and
guard(srcu_fast)(&my_srcu).
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
srcu.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 0aa2376cca0b1..ada65b58bc4c5 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
srcu_read_unlock(_T->lock, _T->idx),
int idx)
+DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
+ _T->scp = srcu_read_lock_fast(_T->lock),
+ srcu_read_unlock_fast(_T->lock, _T->scp),
+ struct srcu_ctr __percpu *scp)
+
DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
_T->scp = srcu_read_lock_fast_notrace(_T->lock),
srcu_read_unlock_fast_notrace(_T->lock, _T->scp),
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
2025-07-17 19:04 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
@ 2025-07-17 19:19 ` Steven Rostedt
2025-07-17 19:51 ` Paul E. McKenney
0 siblings, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 19:19 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 12:04:46 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:
> This adds the usual scoped_guard(srcu_fast, &my_srcu) and
> guard(srcu_fast)(&my_srcu).
>
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-- Steve
> ---
> srcu.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 0aa2376cca0b1..ada65b58bc4c5 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
> srcu_read_unlock(_T->lock, _T->idx),
> int idx)
>
> +DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> + _T->scp = srcu_read_lock_fast(_T->lock),
> + srcu_read_unlock_fast(_T->lock, _T->scp),
> + struct srcu_ctr __percpu *scp)
> +
> DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
> _T->scp = srcu_read_lock_fast_notrace(_T->lock),
> srcu_read_unlock_fast_notrace(_T->lock, _T->scp),
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:18 ` Paul E. McKenney
@ 2025-07-17 19:36 ` Mathieu Desnoyers
2025-07-17 21:27 ` Paul E. McKenney
2026-01-07 1:26 ` Joel Fernandes
2026-01-06 15:08 ` Mathieu Desnoyers
1 sibling, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-07-17 19:36 UTC (permalink / raw)
To: paulmck
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2025-07-17 11:18, Paul E. McKenney wrote:
> On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-17 09:14, Mathieu Desnoyers wrote:
>>> On 2025-07-16 18:54, Paul E. McKenney wrote:
>> [...]
>>>
>>> 2) I think I'm late to the party in reviewing srcu-fast, I'll
>>> go have a look :)
>>
>> OK, I'll bite. :) Please let me know where I'm missing something:
>>
>> Looking at srcu-lite and srcu-fast, I understand that they fundamentally
>> depend on a trick we published here https://lwn.net/Articles/573497/
>> "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0 CPU 1
>>
>> x = 1 y = 1
>> smp_mb smp_mb
>> r2 = y r4 = x
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> into
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0 CPU 1
>>
>> rcu_read_lock()
>> x = 1 y = 1
>> synchronize_rcu()
>> r2 = y r4 = x
>> rcu_read_unlock()
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> So looking at srcu-fast, we have:
>>
>> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
>> * critical sections either because they disables interrupts, because they
>> * are a single instruction, or because they are a read-modify-write atomic
>> * operation, depending on the whims of the architecture.
>>
>> It appears to be pairing, as RCU read-side:
>>
>> - irq off/on implied by this_cpu_inc
>> - atomic
>> - single instruction
>>
>> with synchronize_rcu within the grace period, and hope that this behaves as a
>> smp_mb pairing preventing the srcu read-side critical section from leaking
>> out of the srcu read lock/unlock.
>>
>> I note that there is a validation that rcu_is_watching() within
>> __srcu_read_lock_fast, but it's one thing to have rcu watching, but
>> another to have an actual read-side critical section. Note that
>> preemption, irqs, softirqs can very well be enabled when calling
>> __srcu_read_lock_fast.
>>
>> My understanding of the how memory barriers implemented with RCU
>> work is that we need to surround the memory accesses on the fast-path
>> (where we turn smp_mb into barrier) with an RCU read-side critical
>> section to make sure it does not spawn across a synchronize_rcu.
>>
>> What I am missing here is how can a RCU side-side that only consist
>> of the irq off/on or atomic or single instruction cover all memory
>> accesses we are trying to order, namely those within the srcu
>> critical section after the compiler barrier() ? Is having RCU
>> watching sufficient to guarantee this ?
>
> Good eyes!!!
>
> The trick is that this "RCU read-side critical section" consists only of
> either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> thus both act as an interrupts-disabled RCU read-side critical section.
>
> Therefore, if the SRCU grace-period computation fails to see an
> srcu_read_lock_fast() increment, its earlier code is guaranteed to
> happen before the corresponding critical section. Similarly, if the SRCU
> grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> code is guaranteed to happen after the corresponding critical section.
>
> Does that help? If so, would you be interested and nominating a comment?
>
> Or am I missing something subtle here?
Here is the root of my concern: considering a single instruction
as an RCU-barrier "read-side" for a classic Dekker would not work,
because the read-side would not cover both memory accesses that need
to be ordered.
I cannot help but notice the similarity between this pattern of
barrier vs synchronize_rcu and what we allow userspace to do with
barrier vs sys_membarrier, which has one implementation
based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
when membarrier was introduced, this was based on synchronize_sched(),
and I recall that this was OK because userspace execution acted as
a read-side critical section from the perspective of synchronize_sched().
As commented in kernel v4.10 near synchronize_sched():
* Note that this guarantee implies further memory-ordering guarantees.
* On systems with more than one CPU, when synchronize_sched() returns,
* each CPU is guaranteed to have executed a full memory barrier since the
* end of its last RCU-sched read-side critical section whose beginning
* preceded the call to synchronize_sched(). In addition, each CPU having
* an RCU read-side critical section that extends beyond the return from
* synchronize_sched() is guaranteed to have executed a full memory barrier
* after the beginning of synchronize_sched() and before the beginning of
* that RCU read-side critical section. Note that these guarantees include
* CPUs that are offline, idle, or executing in user mode, as well as CPUs
* that are executing in the kernel.
So even though I see how synchronize_rcu() nowadays is still a good
choice to implement sys_membarrier, it only apply to RCU read side
critical sections, which covers userspace code and the specific
read-side critical sections in the kernel.
But what I don't get is how synchronize_rcu() can help us promote
the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
critical section tracked by the synchronize_rcu grace period,
mainly because unlike the sys_membarrier scenario, this is *not*
userspace code.
And what we want to order here on the read-side is the lock/unlock
increments vs the memory accesses within the critical section, but
there is no RCU read-side that contain all those memory accesses
that match those synchronize_rcu calls, so the promotion from barrier
to smp_mb don't appear to be valid.
But perhaps there is something more that is specific to the SRCU
algorithm that I missing here ?
Thanks,
Mathieu
>
> Either way, many thanks for digging into this!!!
>
> Thanx, Paul
>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
2025-07-17 19:19 ` Steven Rostedt
@ 2025-07-17 19:51 ` Paul E. McKenney
2025-07-17 19:56 ` Steven Rostedt
0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 19:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 03:19:34PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 12:04:46 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > This adds the usual scoped_guard(srcu_fast, &my_srcu) and
> > guard(srcu_fast)(&my_srcu).
> >
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Thank you! I will apply this on my next rebase.
The normal process puts this into the v6.18 merge window, that it, the
merge window after the upcoming one. If you need it sooner, please let
us know.
Thanx, Paul
> -- Steve
>
> > ---
> > srcu.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index 0aa2376cca0b1..ada65b58bc4c5 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -510,6 +510,11 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
> > srcu_read_unlock(_T->lock, _T->idx),
> > int idx)
> >
> > +DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> > + _T->scp = srcu_read_lock_fast(_T->lock),
> > + srcu_read_unlock_fast(_T->lock, _T->scp),
> > + struct srcu_ctr __percpu *scp)
> > +
> > DEFINE_LOCK_GUARD_1(srcu_fast_notrace, struct srcu_struct,
> > _T->scp = srcu_read_lock_fast_notrace(_T->lock),
> > srcu_read_unlock_fast_notrace(_T->lock, _T->scp),
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
2025-07-17 19:51 ` Paul E. McKenney
@ 2025-07-17 19:56 ` Steven Rostedt
2025-07-17 20:38 ` Paul E. McKenney
0 siblings, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2025-07-17 19:56 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, 17 Jul 2025 12:51:29 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:
> Thank you! I will apply this on my next rebase.
>
> The normal process puts this into the v6.18 merge window, that it, the
> merge window after the upcoming one. If you need it sooner, please let
> us know.
I'd suggest 6.17. I can understand the delay for updates to the RCU
logic for how subtle it can be, where more testing is required. But
adding a guard() helper is pretty trivial and useful to have. I
wouldn't delay it.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers
2025-07-17 19:56 ` Steven Rostedt
@ 2025-07-17 20:38 ` Paul E. McKenney
0 siblings, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 20:38 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 03:56:38PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 12:51:29 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > Thank you! I will apply this on my next rebase.
> >
> > The normal process puts this into the v6.18 merge window, that it, the
> > merge window after the upcoming one. If you need it sooner, please let
> > us know.
>
> I'd suggest 6.17. I can understand the delay for updates to the RCU
> logic for how subtle it can be, where more testing is required. But
> adding a guard() helper is pretty trivial and useful to have. I
> wouldn't delay it.
OK thank you, and let's see what we can do in v6.17.
Thanx, Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 19:36 ` Mathieu Desnoyers
@ 2025-07-17 21:27 ` Paul E. McKenney
2026-01-07 1:26 ` Joel Fernandes
1 sibling, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-17 21:27 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 03:36:46PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-17 11:18, Paul E. McKenney wrote:
> > On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > > > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > [...]
> > > >
> > > > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> > > > go have a look :)
> > >
> > > OK, I'll bite. :) Please let me know where I'm missing something:
> > >
> > > Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> > > depend on a trick we published here https://lwn.net/Articles/573497/
> > > "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > x = 1 y = 1
> > > smp_mb smp_mb
> > > r2 = y r4 = x
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > into
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > rcu_read_lock()
> > > x = 1 y = 1
> > > synchronize_rcu()
> > > r2 = y r4 = x
> > > rcu_read_unlock()
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > So looking at srcu-fast, we have:
> > >
> > > * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> > > * critical sections either because they disables interrupts, because they
> > > * are a single instruction, or because they are a read-modify-write atomic
> > > * operation, depending on the whims of the architecture.
> > >
> > > It appears to be pairing, as RCU read-side:
> > >
> > > - irq off/on implied by this_cpu_inc
> > > - atomic
> > > - single instruction
> > >
> > > with synchronize_rcu within the grace period, and hope that this behaves as a
> > > smp_mb pairing preventing the srcu read-side critical section from leaking
> > > out of the srcu read lock/unlock.
> > >
> > > I note that there is a validation that rcu_is_watching() within
> > > __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> > > another to have an actual read-side critical section. Note that
> > > preemption, irqs, softirqs can very well be enabled when calling
> > > __srcu_read_lock_fast.
> > >
> > > My understanding of the how memory barriers implemented with RCU
> > > work is that we need to surround the memory accesses on the fast-path
> > > (where we turn smp_mb into barrier) with an RCU read-side critical
> > > section to make sure it does not spawn across a synchronize_rcu.
> > >
> > > What I am missing here is how can a RCU side-side that only consist
> > > of the irq off/on or atomic or single instruction cover all memory
> > > accesses we are trying to order, namely those within the srcu
> > > critical section after the compiler barrier() ? Is having RCU
> > > watching sufficient to guarantee this ?
> >
> > Good eyes!!!
> >
> > The trick is that this "RCU read-side critical section" consists only of
> > either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> > in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> > Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> > thus both act as an interrupts-disabled RCU read-side critical section.
> >
> > Therefore, if the SRCU grace-period computation fails to see an
> > srcu_read_lock_fast() increment, its earlier code is guaranteed to
> > happen before the corresponding critical section. Similarly, if the SRCU
> > grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> > code is guaranteed to happen after the corresponding critical section.
> >
> > Does that help? If so, would you be interested and nominating a comment?
> >
> > Or am I missing something subtle here?
>
> Here is the root of my concern: considering a single instruction
> as an RCU-barrier "read-side" for a classic Dekker would not work,
> because the read-side would not cover both memory accesses that need
> to be ordered.
You would have a pair of RCU read-side critical sections, and if the
second one was waited on by a given call to synchronize_rcu(), then the
beginning of the first RCU reader would also have preceded that call,
and thus the synchronize_rcu() full-memory-barrier would apply to both
of those RCU read-side critical sections..
But please see below for more detail.
> I cannot help but notice the similarity between this pattern of
> barrier vs synchronize_rcu and what we allow userspace to do with
> barrier vs sys_membarrier, which has one implementation
> based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
> when membarrier was introduced, this was based on synchronize_sched(),
> and I recall that this was OK because userspace execution acted as
> a read-side critical section from the perspective of synchronize_sched().
> As commented in kernel v4.10 near synchronize_sched():
>
> * Note that this guarantee implies further memory-ordering guarantees.
> * On systems with more than one CPU, when synchronize_sched() returns,
> * each CPU is guaranteed to have executed a full memory barrier since the
> * end of its last RCU-sched read-side critical section whose beginning
> * preceded the call to synchronize_sched(). In addition, each CPU having
> * an RCU read-side critical section that extends beyond the return from
> * synchronize_sched() is guaranteed to have executed a full memory barrier
> * after the beginning of synchronize_sched() and before the beginning of
> * that RCU read-side critical section. Note that these guarantees include
> * CPUs that are offline, idle, or executing in user mode, as well as CPUs
> * that are executing in the kernel.
>
> So even though I see how synchronize_rcu() nowadays is still a good
> choice to implement sys_membarrier, it only apply to RCU read side
> critical sections, which covers userspace code and the specific
> read-side critical sections in the kernel.
Yes, it does only apply to RCU read-side critical sections, but the
synchronize_rcu() comment header explicitly states that, given a region
of code across which interrupts are disabled, that region of code also
acts as an RCU read-side critical section:
* RCU read-side critical sections are delimited by rcu_read_lock()
* and rcu_read_unlock(), and may be nested. In addition, but only in
* v5.0 and later, regions of code across which interrupts, preemption,
* or softirqs have been disabled also serve as RCU read-side critical
* sections. This includes hardware interrupt handlers, softirq handlers,
* and NMI handlers.
Interrupts cannot happen in the midst of either this_cpu_inc() or
atomic_long_inc(), so these do act as tiny RCU readers.
And just to be painfully clear, synchronize_rcu() guarantees full barriers
associated with any reader that stick out of the synchronize_rcu()
in either direction:
* Note that this guarantee implies further memory-ordering guarantees.
* On systems with more than one CPU, when synchronize_rcu() returns,
* each CPU is guaranteed to have executed a full memory barrier since
* the end of its last RCU read-side critical section whose beginning
* preceded the call to synchronize_rcu(). In addition, each CPU having
* an RCU read-side critical section that extends beyond the return from
* synchronize_rcu() is guaranteed to have executed a full memory barrier
* after the beginning of synchronize_rcu() and before the beginning of
* that RCU read-side critical section. Note that these guarantees include
* CPUs that are offline, idle, or executing in user mode, as well as CPUs
* that are executing in the kernel.
> But what I don't get is how synchronize_rcu() can help us promote
> the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
> critical section tracked by the synchronize_rcu grace period,
> mainly because unlike the sys_membarrier scenario, this is *not*
> userspace code.
Just for those following along, there is a barrier() call at the
end of __srcu_read_lock_fast() and another at the beginning of
__srcu_read_unlock_fast(), so that is covered.
And then let's take look at this barrier() form of your store-buffering
litmus test:
volatile int x = 0, y = 0
CPU 0 CPU 1
WRITE_ONCE(x, 1) WRITE_ONCE(y, 1)
barrier() synchronize_rcu()
r2 = READ_ONCE(y) r4 = READ_ONCE(x)
BUG_ON(r2 == 0 && r4 == 0)
Each of CPU 0's _ONCE() accesses is implemented with a single instruction,
and each therefore cannot be interrupted. Each therefore acts as a tiny
RCU read-side critical section.
Now, if you put this into an LKMM litmus test, herd7 will in fact say
"Sometimes". But that is because LKMM does not know about interrupts,
let alone the possibility that they might be disabled, and let alone the
fact that atomic operations act like tiny regions of interrupt-disabled
code, let alone the fact that regions of interrupt-disabled code act as
RCU read-side critical sections.
But we can help LKMM out by manually placing the tiny RCU read-side
critical sections, like this:
C C-SB-o-b-o+o-sr-o
{
}
P0(int *x, int *y)
{
rcu_read_lock();
WRITE_ONCE(*x, 1);
rcu_read_unlock();
barrier();
rcu_read_lock();
r2 = READ_ONCE(*y);
rcu_read_unlock();
}
P1(int *x, int *y)
{
WRITE_ONCE(*y, 1);
synchronize_rcu();
r4 = READ_ONCE(*x);
}
exists (0:r2=0 /\ 1:r4=0)
And if we do this, here is what LKMM has to say about it:
$ herd7 -conf linux-kernel.cfg /tmp/C-SB-o-b-o+o-sr-o.litmus
Test C-SB-o-b-o+o-sr-o Allowed
States 3
0:r2=0; 1:r4=1;
0:r2=1; 1:r4=0;
0:r2=1; 1:r4=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r2=0 /\ 1:r4=0)
Observation C-SB-o-b-o+o-sr-o Never 0 3
Time C-SB-o-b-o+o-sr-o 0.01
Hash=f607a3688b66756d2e85042c75d8c1fa
It says "Never", so SRCU-fast should be good from this memory-ordering
viewpoint.
Or am I missing your point?
> And what we want to order here on the read-side is the lock/unlock
> increments vs the memory accesses within the critical section, but
> there is no RCU read-side that contain all those memory accesses
> that match those synchronize_rcu calls, so the promotion from barrier
> to smp_mb don't appear to be valid.
>
> But perhaps there is something more that is specific to the SRCU
> algorithm that I missing here ?
Again, the RCU reader implied by the tiny interrupts-disabled region of
code implied by an atomic operation should do the trick.
If you are instead accusing me of using a very subtle and tricky
synchronization technique, then I plead guilty to charges as read.
On the other hand, this is the Linux-kernel RCU implementation, so it is
not like people reading this code haven't been at least implicitly warned.
I did provide comments attempting to describe this, for example,
for __srcu_read_lock_fast():
* Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
* critical sections either because they disables interrupts, because they
* are a single instruction, or because they are a read-modify-write atomic
* operation, depending on the whims of the architecture.
I would welcome upgrades that more clearly describe this trick.
The s/disables/disable/ would be one step in the right direction. :-/
And again, many thanks for digging into this!!!
And again, am I missing your point?
Thanx, Paul
> Thanks,
>
> Mathieu
>
> >
> > Either way, many thanks for digging into this!!!
> >
> > Thanx, Paul
> >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > https://www.efficios.com
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-16 22:54 ` Paul E. McKenney
2025-07-17 13:14 ` Mathieu Desnoyers
2025-07-17 19:04 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
@ 2025-07-19 0:28 ` Paul E. McKenney
2 siblings, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2025-07-19 0:28 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
Hello!
This is version 2 of a patch series introducing SRCU-fast to the
__DECLARE_TRACE() in place of the current preemption disabling.
This change enable preemption of BPF programs attached to tracepoints
in in, as is required for runtime use of BPF in real-time systems.
The patches are as follows:
1. Move rcu_is_watching() checks to srcu_read_{,un}lock_fast().
2. Add srcu_read_lock_fast_notrace() and
srcu_read_unlock_fast_notrace().
3. Add guards for notrace variants of SRCU-fast readers.
4. Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast.
Changes since RFC version:
o RFC patch 6/4 has been pulled into the shared RCU tree:
e88c632a8698 ("srcu: Add guards for SRCU-fast readers")
o RFC patch 5/4 (which removed the now-unnecessary special boot-time
avoidance of SRCU) has been folded into patch 4/4 shown above,
as suggested by Steven Rostedt.
Thanx, Paul
------------------------------------------------------------------------
b/include/linux/srcu.h | 4 ++++
b/include/linux/srcutree.h | 2 --
b/include/linux/tracepoint.h | 6 ++++--
b/kernel/tracepoint.c | 21 ++++++++++++++++++++-
include/linux/srcu.h | 30 ++++++++++++++++++++++++++++++
5 files changed, 58 insertions(+), 5 deletions(-)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 15:18 ` Paul E. McKenney
2025-07-17 19:36 ` Mathieu Desnoyers
@ 2026-01-06 15:08 ` Mathieu Desnoyers
2026-01-06 18:30 ` Paul E. McKenney
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2026-01-06 15:08 UTC (permalink / raw)
To: paulmck
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2025-07-17 11:18, Paul E. McKenney wrote:
> On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-17 09:14, Mathieu Desnoyers wrote:
>>> On 2025-07-16 18:54, Paul E. McKenney wrote:
>> [...]
>>>
>>> 2) I think I'm late to the party in reviewing srcu-fast, I'll
>>> go have a look :)
>>
>> OK, I'll bite. :) Please let me know where I'm missing something:
>>
>> Looking at srcu-lite and srcu-fast, I understand that they fundamentally
>> depend on a trick we published here https://lwn.net/Articles/573497/
>> "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0 CPU 1
>>
>> x = 1 y = 1
>> smp_mb smp_mb
>> r2 = y r4 = x
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> into
>>
>> volatile int x = 0, y = 0
>>
>> CPU 0 CPU 1
>>
>> rcu_read_lock()
>> x = 1 y = 1
>> synchronize_rcu()
>> r2 = y r4 = x
>> rcu_read_unlock()
>>
>> BUG_ON(r2 == 0 && r4 == 0)
>>
>> So looking at srcu-fast, we have:
>>
>> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
>> * critical sections either because they disables interrupts, because they
>> * are a single instruction, or because they are a read-modify-write atomic
>> * operation, depending on the whims of the architecture.
>>
>> It appears to be pairing, as RCU read-side:
>>
>> - irq off/on implied by this_cpu_inc
>> - atomic
>> - single instruction
>>
>> with synchronize_rcu within the grace period, and hope that this behaves as a
>> smp_mb pairing preventing the srcu read-side critical section from leaking
>> out of the srcu read lock/unlock.
>>
>> I note that there is a validation that rcu_is_watching() within
>> __srcu_read_lock_fast, but it's one thing to have rcu watching, but
>> another to have an actual read-side critical section. Note that
>> preemption, irqs, softirqs can very well be enabled when calling
>> __srcu_read_lock_fast.
>>
>> My understanding of the how memory barriers implemented with RCU
>> work is that we need to surround the memory accesses on the fast-path
>> (where we turn smp_mb into barrier) with an RCU read-side critical
>> section to make sure it does not spawn across a synchronize_rcu.
>>
>> What I am missing here is how can a RCU side-side that only consist
>> of the irq off/on or atomic or single instruction cover all memory
>> accesses we are trying to order, namely those within the srcu
>> critical section after the compiler barrier() ? Is having RCU
>> watching sufficient to guarantee this ?
>
> Good eyes!!!
>
> The trick is that this "RCU read-side critical section" consists only of
> either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> thus both act as an interrupts-disabled RCU read-side critical section.
>
> Therefore, if the SRCU grace-period computation fails to see an
> srcu_read_lock_fast() increment, its earlier code is guaranteed to
> happen before the corresponding critical section. Similarly, if the SRCU
> grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> code is guaranteed to happen after the corresponding critical section.
>
> Does that help? If so, would you be interested and nominating a comment?
>
> Or am I missing something subtle here?
>
> Either way, many thanks for digging into this!!!
Re-reading the comment and your explanation, I think the comments are
clear enough. One nit I found while reading though:
include/linux/srcutree.h:
* Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
* critical sections either because they disables interrupts, because
disables -> disable
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2026-01-06 15:08 ` Mathieu Desnoyers
@ 2026-01-06 18:30 ` Paul E. McKenney
2026-01-06 18:43 ` Mathieu Desnoyers
0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2026-01-06 18:30 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Tue, Jan 06, 2026 at 10:08:44AM -0500, Mathieu Desnoyers wrote:
> On 2025-07-17 11:18, Paul E. McKenney wrote:
> > On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > > > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > [...]
> > > >
> > > > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> > > > go have a look :)
> > >
> > > OK, I'll bite. :) Please let me know where I'm missing something:
> > >
> > > Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> > > depend on a trick we published here https://lwn.net/Articles/573497/
> > > "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > x = 1 y = 1
> > > smp_mb smp_mb
> > > r2 = y r4 = x
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > into
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > rcu_read_lock()
> > > x = 1 y = 1
> > > synchronize_rcu()
> > > r2 = y r4 = x
> > > rcu_read_unlock()
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > So looking at srcu-fast, we have:
> > >
> > > * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> > > * critical sections either because they disables interrupts, because they
> > > * are a single instruction, or because they are a read-modify-write atomic
> > > * operation, depending on the whims of the architecture.
> > >
> > > It appears to be pairing, as RCU read-side:
> > >
> > > - irq off/on implied by this_cpu_inc
> > > - atomic
> > > - single instruction
> > >
> > > with synchronize_rcu within the grace period, and hope that this behaves as a
> > > smp_mb pairing preventing the srcu read-side critical section from leaking
> > > out of the srcu read lock/unlock.
> > >
> > > I note that there is a validation that rcu_is_watching() within
> > > __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> > > another to have an actual read-side critical section. Note that
> > > preemption, irqs, softirqs can very well be enabled when calling
> > > __srcu_read_lock_fast.
> > >
> > > My understanding of the how memory barriers implemented with RCU
> > > work is that we need to surround the memory accesses on the fast-path
> > > (where we turn smp_mb into barrier) with an RCU read-side critical
> > > section to make sure it does not spawn across a synchronize_rcu.
> > >
> > > What I am missing here is how can a RCU side-side that only consist
> > > of the irq off/on or atomic or single instruction cover all memory
> > > accesses we are trying to order, namely those within the srcu
> > > critical section after the compiler barrier() ? Is having RCU
> > > watching sufficient to guarantee this ?
> >
> > Good eyes!!!
> >
> > The trick is that this "RCU read-side critical section" consists only of
> > either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> > in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> > Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> > thus both act as an interrupts-disabled RCU read-side critical section.
> >
> > Therefore, if the SRCU grace-period computation fails to see an
> > srcu_read_lock_fast() increment, its earlier code is guaranteed to
> > happen before the corresponding critical section. Similarly, if the SRCU
> > grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> > code is guaranteed to happen after the corresponding critical section.
> >
> > Does that help? If so, would you be interested and nominating a comment?
> >
> > Or am I missing something subtle here?
> >
> > Either way, many thanks for digging into this!!!
> Re-reading the comment and your explanation, I think the comments are
> clear enough. One nit I found while reading though:
>
> include/linux/srcutree.h:
>
> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> * critical sections either because they disables interrupts, because
>
> disables -> disable
Like this?
Thanx, Paul
------------------------------------------------------------------------
commit b3fc7ac622a19fcf921871be097e3536847406cd
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Tue Jan 6 10:28:10 2026 -0800
srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
Typo fix in srcu_read_unlock_fast() header comment.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index d6f978b50472d..727719a3cbeb9 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -259,7 +259,7 @@ static inline struct srcu_ctr __percpu *__srcu_ctr_to_ptr(struct srcu_struct *ss
* srcu_read_unlock_fast().
*
* Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
- * critical sections either because they disables interrupts, because
+ * critical sections either because they disable interrupts, because
* they are a single instruction, or because they are read-modify-write
* atomic operations, depending on the whims of the architecture.
* This matters because the SRCU-fast grace-period mechanism uses either
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2026-01-06 18:30 ` Paul E. McKenney
@ 2026-01-06 18:43 ` Mathieu Desnoyers
0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2026-01-06 18:43 UTC (permalink / raw)
To: paulmck
Cc: Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On 2026-01-06 13:30, Paul E. McKenney wrote:
> On Tue, Jan 06, 2026 at 10:08:44AM -0500, Mathieu Desnoyers wrote:
[...]
>> Re-reading the comment and your explanation, I think the comments are
>> clear enough. One nit I found while reading though:
>>
>> include/linux/srcutree.h:
>>
>> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
>> * critical sections either because they disables interrupts, because
>>
>> disables -> disable
>
> Like this?
Looks good !
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Thanks,
Mathieu
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit b3fc7ac622a19fcf921871be097e3536847406cd
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date: Tue Jan 6 10:28:10 2026 -0800
>
> srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
>
> Typo fix in srcu_read_unlock_fast() header comment.
>
> Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index d6f978b50472d..727719a3cbeb9 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -259,7 +259,7 @@ static inline struct srcu_ctr __percpu *__srcu_ctr_to_ptr(struct srcu_struct *ss
> * srcu_read_unlock_fast().
> *
> * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> - * critical sections either because they disables interrupts, because
> + * critical sections either because they disable interrupts, because
> * they are a single instruction, or because they are read-modify-write
> * atomic operations, depending on the whims of the architecture.
> * This matters because the SRCU-fast grace-period mechanism uses either
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
2025-07-17 19:36 ` Mathieu Desnoyers
2025-07-17 21:27 ` Paul E. McKenney
@ 2026-01-07 1:26 ` Joel Fernandes
1 sibling, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2026-01-07 1:26 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: paulmck, Steven Rostedt, Sebastian Andrzej Siewior, Boqun Feng,
linux-rt-devel, rcu, linux-trace-kernel, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Lai Jiangshan, Masami Hiramatsu,
Neeraj Upadhyay, Thomas Gleixner, Uladzislau Rezki, Zqiang, bpf
On Thu, Jul 17, 2025 at 03:36:46PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-17 11:18, Paul E. McKenney wrote:
> > On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
> > > On 2025-07-17 09:14, Mathieu Desnoyers wrote:
> > > > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > [...]
> > > >
> > > > 2) I think I'm late to the party in reviewing srcu-fast, I'll
> > > > go have a look :)
> > >
> > > OK, I'll bite. :) Please let me know where I'm missing something:
> > >
> > > Looking at srcu-lite and srcu-fast, I understand that they fundamentally
> > > depend on a trick we published here https://lwn.net/Articles/573497/
> > > "The RCU-barrier menagerie" that allows turning, e.g. this Dekker:
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > x = 1 y = 1
> > > smp_mb smp_mb
> > > r2 = y r4 = x
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > into
> > >
> > > volatile int x = 0, y = 0
> > >
> > > CPU 0 CPU 1
> > >
> > > rcu_read_lock()
> > > x = 1 y = 1
> > > synchronize_rcu()
> > > r2 = y r4 = x
> > > rcu_read_unlock()
> > >
> > > BUG_ON(r2 == 0 && r4 == 0)
> > >
> > > So looking at srcu-fast, we have:
> > >
> > > * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
> > > * critical sections either because they disables interrupts, because they
> > > * are a single instruction, or because they are a read-modify-write atomic
> > > * operation, depending on the whims of the architecture.
> > >
> > > It appears to be pairing, as RCU read-side:
> > >
> > > - irq off/on implied by this_cpu_inc
> > > - atomic
> > > - single instruction
> > >
> > > with synchronize_rcu within the grace period, and hope that this behaves as a
> > > smp_mb pairing preventing the srcu read-side critical section from leaking
> > > out of the srcu read lock/unlock.
> > >
> > > I note that there is a validation that rcu_is_watching() within
> > > __srcu_read_lock_fast, but it's one thing to have rcu watching, but
> > > another to have an actual read-side critical section. Note that
> > > preemption, irqs, softirqs can very well be enabled when calling
> > > __srcu_read_lock_fast.
> > >
> > > My understanding of the how memory barriers implemented with RCU
> > > work is that we need to surround the memory accesses on the fast-path
> > > (where we turn smp_mb into barrier) with an RCU read-side critical
> > > section to make sure it does not spawn across a synchronize_rcu.
> > >
> > > What I am missing here is how can a RCU side-side that only consist
> > > of the irq off/on or atomic or single instruction cover all memory
> > > accesses we are trying to order, namely those within the srcu
> > > critical section after the compiler barrier() ? Is having RCU
> > > watching sufficient to guarantee this ?
> >
> > Good eyes!!!
> >
> > The trick is that this "RCU read-side critical section" consists only of
> > either this_cpu_inc() or atomic_long_inc(), with the latter only happening
> > in systems that have NMIs, but don't have NMI-safe per-CPU operations.
> > Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
> > thus both act as an interrupts-disabled RCU read-side critical section.
> >
> > Therefore, if the SRCU grace-period computation fails to see an
> > srcu_read_lock_fast() increment, its earlier code is guaranteed to
> > happen before the corresponding critical section. Similarly, if the SRCU
> > grace-period computation sees an srcu_read_unlock_fast(), its subsequent
> > code is guaranteed to happen after the corresponding critical section.
> >
> > Does that help? If so, would you be interested and nominating a comment?
> >
> > Or am I missing something subtle here?
>
> Here is the root of my concern: considering a single instruction
> as an RCU-barrier "read-side" for a classic Dekker would not work,
> because the read-side would not cover both memory accesses that need
> to be ordered.
I had similar questions, so let me share, assuming I even got your question
correct. I think what might help is to note that for synchronize_rcu(), any
transition to and from RCU watching is a full memory barrier. And if I
understand correctly, because SRCU fast is not allowed when RCU is not
watching, the memory ordering Required for RCU correctness is automatically
taken care of because the update side uses synchronize_rcu().
That is my understanding. Could you clarify how your question is related to the
single instruction thing? I think if you apply the
classical RCU reasoning where RCU (regular RCU) read-side critical sections
do not have any memory barriers, then you will see the similarity of that
with srcu-fast.
thanks,
- Joel
>
> I cannot help but notice the similarity between this pattern of
> barrier vs synchronize_rcu and what we allow userspace to do with
> barrier vs sys_membarrier, which has one implementation
> based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
> when membarrier was introduced, this was based on synchronize_sched(),
> and I recall that this was OK because userspace execution acted as
> a read-side critical section from the perspective of synchronize_sched().
> As commented in kernel v4.10 near synchronize_sched():
>
> * Note that this guarantee implies further memory-ordering guarantees.
> * On systems with more than one CPU, when synchronize_sched() returns,
> * each CPU is guaranteed to have executed a full memory barrier since the
> * end of its last RCU-sched read-side critical section whose beginning
> * preceded the call to synchronize_sched(). In addition, each CPU having
> * an RCU read-side critical section that extends beyond the return from
> * synchronize_sched() is guaranteed to have executed a full memory barrier
> * after the beginning of synchronize_sched() and before the beginning of
> * that RCU read-side critical section. Note that these guarantees include
> * CPUs that are offline, idle, or executing in user mode, as well as CPUs
> * that are executing in the kernel.
>
> So even though I see how synchronize_rcu() nowadays is still a good
> choice to implement sys_membarrier, it only apply to RCU read side
> critical sections, which covers userspace code and the specific
> read-side critical sections in the kernel.
>
> But what I don't get is how synchronize_rcu() can help us promote
> the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
> critical section tracked by the synchronize_rcu grace period,
> mainly because unlike the sys_membarrier scenario, this is *not*
> userspace code.
>
> And what we want to order here on the read-side is the lock/unlock
> increments vs the memory accesses within the critical section, but
> there is no RCU read-side that contain all those memory accesses
> that match those synchronize_rcu calls, so the promotion from barrier
> to smp_mb don't appear to be valid.
>
> But perhaps there is something more that is specific to the SRCU
> algorithm that I missing here ?
>
> Thanks,
>
> Mathieu
>
> >
> > Either way, many thanks for digging into this!!!
> >
> > Thanx, Paul
> >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > https://www.efficios.com
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2026-01-07 1:27 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250623104941.WxOQtAmV@linutronix.de>
[not found] ` <03083dee-6668-44bb-9299-20eb68fd00b8@paulmck-laptop>
[not found] ` <fa80f087-d4ff-4499-aec9-157edafb85eb@paulmck-laptop>
[not found] ` <29b5c215-7006-4b27-ae12-c983657465e1@efficios.com>
[not found] ` <acb07426-db2f-4268-97e2-a9588c921366@paulmck-laptop>
[not found] ` <ba0743dc-8644-4355-862b-d38a7791da4c@efficios.com>
[not found] ` <512331d8-fdb4-4dc1-8d9b-34cc35ba48a5@paulmck-laptop>
[not found] ` <bbe08cca-72c4-4bd2-a894-97227edcd1ad@efficios.com>
[not found] ` <16dd7f3c-1c0f-4dfd-bfee-4c07ec844b72@paulmck-laptop>
[not found] ` <20250716110922.0dadc4ec@batman.local.home>
2025-07-16 20:35 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2025-07-16 22:54 ` Paul E. McKenney
2025-07-17 13:14 ` Mathieu Desnoyers
2025-07-17 14:46 ` Mathieu Desnoyers
2025-07-17 15:18 ` Paul E. McKenney
2025-07-17 19:36 ` Mathieu Desnoyers
2025-07-17 21:27 ` Paul E. McKenney
2026-01-07 1:26 ` Joel Fernandes
2026-01-06 15:08 ` Mathieu Desnoyers
2026-01-06 18:30 ` Paul E. McKenney
2026-01-06 18:43 ` Mathieu Desnoyers
2025-07-17 14:57 ` Alexei Starovoitov
2025-07-17 15:12 ` Steven Rostedt
2025-07-17 15:27 ` Alexei Starovoitov
2025-07-17 15:40 ` Steven Rostedt
2025-07-17 15:55 ` Steven Rostedt
2025-07-17 16:02 ` Alexei Starovoitov
2025-07-17 16:19 ` Steven Rostedt
2025-07-17 17:38 ` Mathieu Desnoyers
2025-07-17 16:04 ` Paul E. McKenney
2025-07-17 15:44 ` Paul E. McKenney
2025-07-17 15:30 ` Paul E. McKenney
2025-07-17 15:07 ` Paul E. McKenney
2025-07-17 19:04 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
2025-07-17 19:19 ` Steven Rostedt
2025-07-17 19:51 ` Paul E. McKenney
2025-07-17 19:56 ` Steven Rostedt
2025-07-17 20:38 ` Paul E. McKenney
2025-07-19 0:28 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox