Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: paulmck@kernel.org
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Boqun Feng <boqun.feng@gmail.com>,
	linux-rt-devel@lists.linux.dev, rcu@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	Frederic Weisbecker <frederic@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Uladzislau Rezki <urezki@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>
Subject: Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
Date: Fri, 11 Jul 2025 09:46:25 -0400	[thread overview]
Message-ID: <bbe08cca-72c4-4bd2-a894-97227edcd1ad@efficios.com> (raw)
In-Reply-To: <512331d8-fdb4-4dc1-8d9b-34cc35ba48a5@paulmck-laptop>

On 2025-07-09 14:33, Paul E. McKenney wrote:
> On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
>> On 2025-07-08 16:49, Paul E. McKenney wrote:
>>> On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
>>>> On 2025-07-07 17:56, Paul E. McKenney wrote:
>>>>> On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
>>>>>> On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
>>>>>>> On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
[...]
> 
>> I have nothing against an immediate IRQ-work, but I'm worried about
>> _nested_ immediate IRQ-work, where we end up triggering a endless
>> recursion of IRQ-work through instrumentation.
>>
>> Ideally we would want to figure out a way to prevent endless recursion,
>> while keeping the immediate IRQ-work within rcu_read_unlock_notrace()
>> to keep RT latency within bounded ranges, but without adding unwanted
>> overhead by adding too many conditional branches to the fast-paths.
>>
>> Is there a way to issue a given IRQ-work only when not nested in that
>> IRQ-work handler ? Any way we can detect and prevent that recursion
>> should work fine.
> 
> You are looking for this commit from Joel Fernandes:
> 
> 3284e4adca9b ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> 
> This is in the shiny-ish new-ish shared RCU tree at:
> 
> 	git@gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux
> 
> Or Message-Id <20250709104118.15532-6-neeraj.upadhyay@kernel.org>.

Whatever means of tracking whether we are already in an irq work
context and prevent emitting a recursive irq work should do.

Do I understand correctly that this commit takes care of preventing
recursive irq_work_queue_on() calls, but does not solve the issue of
recursive raise_softirq_irqoff() caused by tracepoint instrumentation ?

[...]

>>
>> As a side-note, I think the "_notrace" suffix I've described comes from
>> the "notrace" function attribute background, which is indeed also used
>> to prevent function tracing of the annotated functions, for similar
>> purposes.
>>
>> Kprobes also has annotation mechanisms to prevent inserting breakpoints
>> in specific functions, and in other cases we rely on compiler flags to
>> prevent instrumentation of entire objects.
>>
>> But mostly the goal of all of those mechanisms is the same: allow some
>> kernel code to be used from instrumentation and tracer callbacks without
>> triggering endless recursion.
> 
> OK, so is there some tracing anti-recursion flag in effect?
> Or is there something else that I must do before invoking either
> raise_softirq_irqoff() or irq_work_queue_on()?

Note that we have two scenarios:

- A nested scenario with a causality relationship, IOW circular
   dependency, which leads to endless recursion and a crash, and

- A nested scenario which is just the result of softirq, irq, nmi
   nesting,

In all of those scenarios, what we really care about here is to make
sure the outermost execution context emits the irq work to prevent
long latency, but we don't care if the nested contexts skip it.

> 
> Plus RCU does need *some* hint that it is not supposed to invoke
> rcu_preempt_deferred_qs_irqrestore() in this case.  Which might be
> why you are suggesting an rcu_read_unlock_notrace().

The rcu_read_unlock_notrace() approach removes RCU instrumentation, even
for the outermost nesting level. We'd be losing some RCU instrumentation
coverage there, but on the upside we would prevent emitting an RCU read
unlock tracepoint for every other tracepoint hit, which I think is good.

If instead we take an approach where we track in which instrumentation
nesting level we are within the task_struct, then we can skip calls to
rcu_preempt_deferred_qs_irqrestore() in all nested contexts, but keep
calling it in the outermost context.

But I would tend to favor the notrace approach so we don't emit
semi-useless RCU read lock/unlock events for every other tracepoint
hit. It would clutter the trace.

> 
>>>>> One possible path forward is to ensure that rcu_read_unlock_special()
>>>>> calls only functions that are compatible with the notrace/trace
>>>>> requirements.  The ones that look like they might need some help are
>>>>> raise_softirq_irqoff() and irq_work_queue_on().  Note that although
>>>>> rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
>>>>> avoid its being invoked, for example, by disabing interrupts across the
>>>>> call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
>>>>> do the disabling.
>>>>>
>>>>> However, I could easily be missing something, especially given my being
>>>>> confused by the juxtaposition of "notrace" and "usable to trace the
>>>>> RCU implementation".  These appear to me to be contradicting each other.
>>>>>
>>>>> Help?
>>>>
>>>> You indeed need to ensure that everything that is called from
>>>> rcu_{lock,unlock]_notrace don't end up executing instrumentation
>>>> to prevent a circular dependency. You hinted at a few ways to achieve
>>>> this. Other possible approaches:
>>>>
>>>> - Add a "trace" bool parameter to rcu_read_unlock_special(),
>>>> - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.
>>>
>>> OK, both of these are reasonable alternatives for the API, but it will
>>> still be necessary to figure out how to make the notrace-incompatible
>>> work happen.
>>
>> One downside of those two approaches is that they require to somewhat
>> duplicate the code (trace vs notrace). This makes it tricky in the
>> case of irq work, because the irq work is just some interrupt, so we're
>> limited in how we can pass around parameters that would use the
>> notrace code.
> 
> Joel's patch, which is currently slated for the upcoming merge window,
> should take care of the endless-IRQ-work recursion problem.  So the
> main remaining issue is how rcu_read_unlock_special() should go about
> safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
> notrace mode.

Are those two functions only needed from the outermost rcu_read_unlock
within a thread ? If yes, then keeping track of nesting level and
preventing those calls in nested contexts (per thread) should work.

> 
>>>> - Keep some nesting count in the task struct to prevent calling the
>>>>     instrumentation when nested in notrace,
>>>
>>> OK, for this one, is the idea to invoke some TBD RCU API the tracing
>>> exits the notrace region?  I could see that working.  But there would
>>> need to be a guarantee that if the notrace API was invoked, a call to
>>> this TBD RCU API would follow in short order.  And I suspect that
>>> preemption (and probably also interrupts) would need to be disabled
>>> across this region.
>>
>> No quite.
>>
>> What I have in mind is to try to find the most elegant way to prevent
>> endless recursion of the irq work issued immediately on
>> rcu_read_unlock_notrace without slowing down most fast paths, and
>> ideally without too much code duplication.
>>
>> I'm not entirely sure what would be the best approach though.
> 
> Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
> flag, so that it is cleared not in the IRQ-work handler, but
> instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
> rcu_read_unlock_special() from requeueing the IRQ-work handler until
> after the previous request for a quiescent state has been satisfied.
> 
> So my main concern is again safely invoking raise_softirq_irqoff()
> and irq_work_queue_on() when in notrace mode.

Would the nesting counter (per thread) approach suffice for your use
case ?

> 
>>>> There are probably other possible approaches I am missing, each with
>>>> their respective trade offs.
>>>
>>> I am pretty sure that we also have some ways to go before we have the
>>> requirements fully laid out, for that matter.  ;-)
>>>
>>> Could you please tell me where in the current tracing code these
>>> rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?
>>
>> AFAIU here:
>>
>> include/linux/tracepoint.h:
>>
>> #define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]
>>
>>          static inline void __do_trace_##name(proto)                     \
>>          {                                                               \
>>                  if (cond) {                                             \
>>                          guard(preempt_notrace)();                       \
>>                          __DO_TRACE_CALL(name, TP_ARGS(args));           \
>>                  }                                                       \
>>          }                                                               \
>>          static inline void trace_##name(proto)                          \
>>          {                                                               \
>>                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>                          __do_trace_##name(args);                        \
>>                  if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
>>                          WARN_ONCE(!rcu_is_watching(),                   \
>>                                    "RCU not watching for tracepoint");   \
>>                  }                                                       \
>>          }
>>
>> and
>>
>> #define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]
>>
>>          static inline void __do_trace_##name(proto)                     \
>>          {                                                               \
>>                  guard(rcu_tasks_trace)();                               \
>>                  __DO_TRACE_CALL(name, TP_ARGS(args));                   \
>>          }                                                               \
>>          static inline void trace_##name(proto)                          \
>>          {                                                               \
>>                  might_fault();                                          \
>>                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
>>                          __do_trace_##name(args);                        \
>>                  if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
>>                          WARN_ONCE(!rcu_is_watching(),                   \
>>                                    "RCU not watching for tracepoint");   \
>>                  }                                                       \
>>          }
> 
> I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
> and guard(rcu_tasks_trace)().  Or is the idea to move the first to
> guard(rcu_notrace)() in order to improve PREEMPT_RT latency?

AFAIU the goal here is to turn the guard(preempt_notrace)() into a
guard(rcu_notrace)() because the preempt-off critical sections don't
agree with BPF.

Thanks,

Mathieu

> 
> 							Thanx, Paul


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

next prev parent reply	other threads:[~2025-07-11 13:46 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
2025-06-18 17:21   ` Boqun Feng
2025-06-20  8:43     ` Sebastian Andrzej Siewior
2025-06-20 11:23       ` Paul E. McKenney
2025-06-23 10:49         ` Sebastian Andrzej Siewior
2025-06-23 18:13           ` Paul E. McKenney
2025-07-07 21:56             ` Paul E. McKenney
2025-07-08 19:40               ` Mathieu Desnoyers
2025-07-08 20:49                 ` Paul E. McKenney
2025-07-09 14:31                   ` Mathieu Desnoyers
2025-07-09 18:33                     ` Paul E. McKenney
2025-07-11 13:46                       ` Mathieu Desnoyers [this message]
2025-07-11 17:05                         ` Paul E. McKenney
2025-07-14 16:34                           ` Paul E. McKenney
2025-07-15 19:56                             ` Mathieu Desnoyers
2025-07-15 23:23                               ` Paul E. McKenney
2025-07-15 19:54                           ` Mathieu Desnoyers
2025-07-15 23:18                             ` Paul E. McKenney
2025-07-16  0:42                               ` Paul E. McKenney
2025-07-16  4:41                                 ` Paul E. McKenney
2025-07-16 15:09                           ` Steven Rostedt
2025-07-16 20:35                             ` Paul E. McKenney
2025-07-16 22:54                               ` Paul E. McKenney
2025-07-17 13:14                                 ` Mathieu Desnoyers
2025-07-17 14:46                                   ` Mathieu Desnoyers
2025-07-17 15:18                                     ` Paul E. McKenney
2025-07-17 19:36                                       ` Mathieu Desnoyers
2025-07-17 21:27                                         ` Paul E. McKenney
2025-07-17 14:57                                   ` Alexei Starovoitov
2025-07-17 15:12                                     ` Steven Rostedt
2025-07-17 15:27                                       ` Alexei Starovoitov
2025-07-17 15:40                                         ` Steven Rostedt
2025-07-17 15:55                                           ` Steven Rostedt
2025-07-17 16:02                                             ` Alexei Starovoitov
2025-07-17 16:19                                               ` Steven Rostedt
2025-07-17 17:38                                               ` Mathieu Desnoyers
2025-07-17 16:04                                             ` Paul E. McKenney
2025-07-17 15:44                                         ` Paul E. McKenney
2025-07-17 15:30                                     ` Paul E. McKenney
2025-07-17 15:07                                   ` Paul E. McKenney
2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
2025-07-17 19:19                                   ` Steven Rostedt
2025-07-17 19:51                                     ` Paul E. McKenney
2025-07-17 19:56                                       ` Steven Rostedt
2025-07-17 20:38                                         ` Paul E. McKenney
2025-07-19  0:28                                 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2025-06-13 15:22 ` [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable() Sebastian Andrzej Siewior
2025-06-13 15:38 ` [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bbe08cca-72c4-4bd2-a894-97227edcd1ad@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).