linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Boqun Feng <boqun.feng@gmail.com>,
	linux-rt-devel@lists.linux.dev, rcu@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	Frederic Weisbecker <frederic@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Uladzislau Rezki <urezki@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>
Subject: Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()
Date: Tue, 8 Jul 2025 13:49:05 -0700	[thread overview]
Message-ID: <acb07426-db2f-4268-97e2-a9588c921366@paulmck-laptop> (raw)
In-Reply-To: <29b5c215-7006-4b27-ae12-c983657465e1@efficios.com>

On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:
> On 2025-07-07 17:56, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:
> > > > > > I hope not because it is not any different from
> > > > > > 
> > > > > >   	CPU 2			CPU 3
> > > > > >   	=====                   =====
> > > > > > 	NMI
> > > > > >   	rcu_read_lock();
> > > > > >   				synchronize_rcu();
> > > > > >   				// need all CPUs report a QS.
> > > > > >   	rcu_read_unlock();
> > > > > >   	// no rcu_read_unlock_special() due to in_nmi().
> > > > > > 
> > > > > > If the NMI happens while the CPU is in userland (say a perf event) then
> > > > > > the NMI returns directly to userland.
> > > > > > After the tracing event completes (in this case) the CPU should run into
> > > > > > another RCU section on its way out via context switch or the tick
> > > > > > interrupt.
> > > > > > I assume the tick interrupt is what makes the NMI case work.
> > > > > 
> > > > > Are you promising that interrupts will be always be disabled across
> > > > > the whole rcu_read_lock_notrace() read-side critical section?  If so,
> > > > > could we please have a lockdep_assert_irqs_disabled() call to check that?
> > > > 
> > > > No, that should stay preemptible because bpf can attach itself to
> > > > tracepoints and this is the root cause of the exercise. Now if you say
> > > > it has to be run with disabled interrupts to match the NMI case then it
> > > > makes sense (since NMIs have interrupts off) but I do not understand why
> > > > it matters here (since the CPU returns to userland without passing the
> > > > kernel).
> > > 
> > > Given your patch, if you don't disable interrupts in a preemptible kernel
> > > across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
> > > concurrent expedited grace period might send its IPI in the middle of that
> > > critical section.  That IPI handler would set up state so that the next
> > > rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
> > > Except that without the call to rcu_read_unlock_special(), there might
> > > not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().
> > > 
> > > This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
> > > Then if that critical section was preempted and then priority-boosted,
> > > the unboosting also won't happen until the next call to that same
> > > rcu_preempt_deferred_qs_irqrestore() function, which again might not
> > > happen.  Or might be significantly delayed.
> > > 
> > > Or am I missing some trick that fixes all of this?
> > > 
> > > > I'm not sure how much can be done here due to the notrace part. Assuming
> > > > rcu_read_unlock_special() is not doable, would forcing a context switch
> > > > (via setting need-resched and irq_work, as the IRQ-off case) do the
> > > > trick?
> > > > Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
> > > > be "usable from the scheduler (with rq lock held)" due to RCU-boosting
> > > > or the wake of expedited_wq (which is one of the requirement).
> > > 
> > > But if rq_lock is held, then interrupts are disabled, which will
> > > cause the unboosting to be deferred.
> > > 
> > > Or are the various deferral mechanisms also unusable in this context?
> > 
> > OK, looking back through this thread, it appears that you need both
> > an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
> > covered by Mathieu's list of requirements [1]:
> 
> I'm just jumping into this email thread, so I'll try to clarify
> what I may.

Thank you for jumping in!

> > | - NMI-safe
> > 	This is covered by the existing rcu_read_lock() and
> > 	rcu_read_unlock().
> 
> OK
> 
> > | - notrace
> > 	I am guessing that by "notrace", you mean the "notrace" CPP
> > 	macro attribute defined in include/linux/compiler_types.h.
> > 	This has no fewer than four different definitions, so I will
> > 	need some help understanding what the restrictions are.
> 
> When I listed notrace in my desiderata, I had in mind the
> preempt_{disable,enable}_notrace macros, which ensure that
> those macros don't call instrumentation, and therefore can be
> used from the implementation of tracepoint instrumentation or
> from a tracer callback.

OK, that makes sense.  I have some questions:

Are interrupts disabled at the calls to rcu_read_unlock_notrace()?

If interrupts are instead enabled, is it OK for an IRQ-work handler that
did an immediate self-interrupt and a bunch of non-notrace work?

If interrupts are to be enabled, but an immediate IRQ-work handler is
ruled out, what mechanism should I be using to defer the work, given
that it cannot be deferred for very long without messing up real-time
latencies?  As in a delay of nanoseconds or maybe a microsecond, but
not multiple microseconds let alone milliseconds?

(I could imagine short-timeout hrtimer, but I am not seeing notrace
variants of these guys, either.)

> > | - usable from the scheduler (with rq lock held)
> > 	This is covered by the existing rcu_read_lock() and
> > 	rcu_read_unlock().
> 
> OK
> 
> > | - usable to trace the RCU implementation
> > 	This one I don't understand.  Can I put tracepoints on
> > 	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
> > 	I was assuming that tracepoints would be forbidden.  Until I
> > 	reached this requirement, that is.
> 
> At a high level, a rcu_read_{lock,unlock}_notrace should be something
> that is safe to call from the tracepoint implementation or from a
> tracer callback.
> 
> My expectation is that the "normal" (not notrace) RCU APIs would
> be instrumented with tracepoints, and tracepoints and tracer callbacks
> are allowed to use the _notrace RCU APIs.
> 
> This provides instrumentation coverage of RCU with the exception of the
> _notrace users, which is pretty much the best we can hope for without
> having a circular dependency.

OK, got it, and that does make sense.  I am not sure how I would have
intuited that from the description, but what is life without a challenge?

> > One possible path forward is to ensure that rcu_read_unlock_special()
> > calls only functions that are compatible with the notrace/trace
> > requirements.  The ones that look like they might need some help are
> > raise_softirq_irqoff() and irq_work_queue_on().  Note that although
> > rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
> > avoid its being invoked, for example, by disabing interrupts across the
> > call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
> > do the disabling.
> > 
> > However, I could easily be missing something, especially given my being
> > confused by the juxtaposition of "notrace" and "usable to trace the
> > RCU implementation".  These appear to me to be contradicting each other.
> > 
> > Help?
> 
> You indeed need to ensure that everything that is called from
> rcu_{lock,unlock]_notrace don't end up executing instrumentation
> to prevent a circular dependency. You hinted at a few ways to achieve
> this. Other possible approaches:
> 
> - Add a "trace" bool parameter to rcu_read_unlock_special(),
> - Duplicate rcu_read_unlock_special() and introduce a notrace symbol.

OK, both of these are reasonable alternatives for the API, but it will
still be necessary to figure out how to make the notrace-incompatible
work happen.

> - Keep some nesting count in the task struct to prevent calling the
>   instrumentation when nested in notrace,

OK, for this one, is the idea to invoke some TBD RCU API the tracing
exits the notrace region?  I could see that working.  But there would
need to be a guarantee that if the notrace API was invoked, a call to
this TBD RCU API would follow in short order.  And I suspect that
preemption (and probably also interrupts) would need to be disabled
across this region.

> There are probably other possible approaches I am missing, each with
> their respective trade offs.

I am pretty sure that we also have some ways to go before we have the
requirements fully laid out, for that matter.  ;-)

Could you please tell me where in the current tracing code these
rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?

							Thanx, Paul

  reply	other threads:[~2025-07-08 20:49 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-13 15:22 [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Sebastian Andrzej Siewior
2025-06-13 15:22 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Sebastian Andrzej Siewior
2025-06-18 17:21   ` Boqun Feng
2025-06-20  8:43     ` Sebastian Andrzej Siewior
2025-06-20 11:23       ` Paul E. McKenney
2025-06-23 10:49         ` Sebastian Andrzej Siewior
2025-06-23 18:13           ` Paul E. McKenney
2025-07-07 21:56             ` Paul E. McKenney
2025-07-08 19:40               ` Mathieu Desnoyers
2025-07-08 20:49                 ` Paul E. McKenney [this message]
2025-07-09 14:31                   ` Mathieu Desnoyers
2025-07-09 18:33                     ` Paul E. McKenney
2025-07-11 13:46                       ` Mathieu Desnoyers
2025-07-11 17:05                         ` Paul E. McKenney
2025-07-14 16:34                           ` Paul E. McKenney
2025-07-15 19:56                             ` Mathieu Desnoyers
2025-07-15 23:23                               ` Paul E. McKenney
2025-07-15 19:54                           ` Mathieu Desnoyers
2025-07-15 23:18                             ` Paul E. McKenney
2025-07-16  0:42                               ` Paul E. McKenney
2025-07-16  4:41                                 ` Paul E. McKenney
2025-07-16 15:09                           ` Steven Rostedt
2025-07-16 20:35                             ` Paul E. McKenney
2025-07-16 22:54                               ` Paul E. McKenney
2025-07-17 13:14                                 ` Mathieu Desnoyers
2025-07-17 14:46                                   ` Mathieu Desnoyers
2025-07-17 15:18                                     ` Paul E. McKenney
2025-07-17 19:36                                       ` Mathieu Desnoyers
2025-07-17 21:27                                         ` Paul E. McKenney
2025-07-17 14:57                                   ` Alexei Starovoitov
2025-07-17 15:12                                     ` Steven Rostedt
2025-07-17 15:27                                       ` Alexei Starovoitov
2025-07-17 15:40                                         ` Steven Rostedt
2025-07-17 15:55                                           ` Steven Rostedt
2025-07-17 16:02                                             ` Alexei Starovoitov
2025-07-17 16:19                                               ` Steven Rostedt
2025-07-17 17:38                                               ` Mathieu Desnoyers
2025-07-17 16:04                                             ` Paul E. McKenney
2025-07-17 15:44                                         ` Paul E. McKenney
2025-07-17 15:30                                     ` Paul E. McKenney
2025-07-17 15:07                                   ` Paul E. McKenney
2025-07-17 19:04                                 ` [PATCH RFC 6/4] srcu: Add guards for SRCU-fast readers Paul E. McKenney
2025-07-17 19:19                                   ` Steven Rostedt
2025-07-17 19:51                                     ` Paul E. McKenney
2025-07-17 19:56                                       ` Steven Rostedt
2025-07-17 20:38                                         ` Paul E. McKenney
2025-07-19  0:28                                 ` [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace() Paul E. McKenney
2025-06-13 15:22 ` [RFC PATCH 2/2] trace: Use rcu_read_lock() instead preempt_disable() Sebastian Andrzej Siewior
2025-06-13 15:38 ` [RFC PATCH 0/2] Switch tracing from sched-RCU to preempt-RCU Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acb07426-db2f-4268-97e2-a9588c921366@paulmck-laptop \
    --to=paulmck@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).