From: Namhyung Kim <namhyung@kernel.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
Masami Hiramatsu <mhiramat@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Andrew Morton <akpm@linux-foundation.org>,
Josh Poimboeuf <jpoimboe@kernel.org>,
x86@kernel.org, Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Indu Bhagat <indu.bhagat@oracle.com>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
Adrian Hunter <adrian.hunter@intel.com>,
linux-perf-users@vger.kernel.org, Mark Brown <broonie@kernel.org>,
linux-toolchains@vger.kernel.org, Jordan Rome <jordalgo@meta.com>,
Sam James <sam@gentoo.org>,
Andrii Nakryiko <andrii.nakryiko@gmail.com>,
Jens Remus <jremus@linux.ibm.com>,
Florian Weimer <fweimer@redhat.com>,
Andy Lutomirski <luto@kernel.org>, Weinan Liu <wnliu@google.com>,
Blake Jones <blakejones@google.com>,
Beau Belgrave <beaub@linux.microsoft.com>,
"Jose E. Marchesi" <jemarch@gnu.org>,
Alexander Aring <aahringo@redhat.com>
Subject: Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
Date: Fri, 2 May 2025 13:16:23 -0700 [thread overview]
Message-ID: <aBUoF8DyzqmiW5vk@google.com> (raw)
In-Reply-To: <20250501165730.69642099@gandalf.local.home>
On Thu, May 01, 2025 at 04:57:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 May 2025 13:14:11 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
>
> > Hi Steve,
> >
> > On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
>
> > > To solve this, when a per CPU event is created that has defer_callchain
> > > attribute set, it will do a lookup from a global list
> > > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > > id that matches the PID of the current task's group_leader.
> >
> > Nice, it'd work well with the perf tools at least.
>
> Cool!
>
>
>
> > > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > > + struct unwind_stacktrace *trace, u64 cookie)
> > > +{
> > > + struct perf_unwind_deferred *defer =
> > > + container_of(work, struct perf_unwind_deferred, unwind_work);
> > > + struct perf_unwind_cpu *cpu_unwind;
> > > + struct perf_event *event;
> > > + int cpu;
> > > +
> > > + guard(rcu)();
> > > + guard(preempt)();
> > > +
> > > + cpu = smp_processor_id();
> > > + cpu_unwind = &defer->cpu_events[cpu];
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 1);
> > > + /*
> > > + * Make sure the above is seen for the rcuwait in
> > > + * perf_remove_unwind_deferred() before iterating the loop.
> > > + */
> > > + smp_mb();
> > > +
> > > + list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > > + perf_event_callchain_deferred(event, trace);
> > > + /* Only the first CPU event gets the trace */
> > > + break;
> >
> > I guess this is to emit a callchain record when more than one events
> > requested the deferred callchains for the same task like:
> >
> > $ perf record -a -e cycles,instructions
> >
> > right?
>
> Yeah. If perf assigns more than one per CPU event, we only need one of
> those events to record the deferred trace, not both of them.
>
> But I keep a link list so that if the program closes the first one and
> keeps the second active, this will still work, as the first one would be
> removed from the list, and the second one would pick up the tracing after
> that.
Makes sense.
>
> >
> >
> > > + }
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 0);
> > > + rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > > +}
> > > +
> > > static void perf_free_addr_filters(struct perf_event *event);
> > >
> > > /* vs perf_event_alloc() error */
> > > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
> > > return 0;
> > > }
> > >
> > > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > > +{
> > > + u64 cookie;
> > > + int ret;
> > > +
> > > + ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > > + return ret < 0 ? ret : 0;
> > > +}
> > > +
> > > /*
> > > * Returns:
> > > * > 0 : if already queued.
> > > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
> > > int pending;
> > > int ret;
> > >
> > > - /* Only defer for task events */
> > > - if (!event->ctx->task)
> > > + if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > return -EINVAL;
> > >
> > > - if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > + if (event->unwind_deferred)
> > > + return deferred_unwind_request(event->unwind_deferred);
> > > +
> > > + /* Per CPU events should have had unwind_deferred set! */
> > > + if (WARN_ON_ONCE(!event->ctx->task))
> > > return -EINVAL;
> > >
> > > if (in_nmi())
> > > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> > > }
> > > }
> > >
> > > + /* Setup unwind deferring for per CPU events */
> > > + if (event->attr.defer_callchain && !task) {
> >
> > As I said it should handle per-task and per-CPU events. How about this?
>
> Hmm, I just added some printk()s in this code, and it seems that perf
> record always did per CPU.
Right, that's the default behavior.
>
> But if an event is per CPU and per task, will it still only trace that
> task? It will never trace another task right?
Yes, the event can be inherited to a child but then child will create a
new event so each task will have its own events.
>
> Because the way this is currently implemented is that the event that
> requested the callback is the one that records it, even if it runs on
> another CPU:
>
> In defer_request_nmi():
>
> struct callback_head *work = &event->pending_unwind_work;
> int ret;
>
> if (event->pending_unwind_callback)
> return 1;
>
> ret = task_work_add(current, work, TWA_NMI_CURRENT);
> if (ret)
> return ret;
>
> event->pending_unwind_callback = 1;
>
> The task_work_add() adds the work from the event's pending_unwind_work.
>
> Now the callback will be:
>
> static void perf_event_deferred_task(struct callback_head *work)
> {
> struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
>
> // the above is the event that requested this. This may run on another CPU.
>
> struct unwind_stacktrace trace;
>
> if (!event->pending_unwind_callback)
> return;
>
> if (unwind_deferred_trace(&trace) >= 0) {
>
> /*
> * All accesses to the event must belong to the same implicit RCU
> * read-side critical section as the ->pending_unwind_callback reset.
> * See comment in perf_pending_unwind_sync().
> */
> guard(rcu)();
> perf_event_callchain_deferred(event, &trace);
>
> // The above records the stack trace to that event.
> // Again, this may happen on another CPU.
>
> }
>
> event->pending_unwind_callback = 0;
> local_dec(&event->ctx->nr_no_switch_fast);
> rcuwait_wake_up(&event->pending_unwind_wait);
> }
>
> Is the recording to an event from one CPU to another CPU an issue, if that
> event also is only tracing a task?
IIUC it should be fine as long as you use the unwind descriptor logic
like in the per-CPU case. The data should be written to the current
CPU's ring buffer for per-task and per-CPU events.
>
> >
> > if (event->attr.defer_callchain) {
> > if (event->cpu >= 0) {
> > err = perf_add_unwind_deferred(event);
> > if (err)
> > return ERR_PTR(err);
> > } else {
> > init_task_work(&event->pending_unwind_work,
> > perf_event_callchain_deferred,
> > perf_event_deferred_task);
> > }
> > }
> >
> > > + err = perf_add_unwind_deferred(event);
> > > + if (err)
> > > + return ERR_PTR(err);
> > > + }
> > > +
> > > err = security_perf_event_alloc(event);
> > > if (err)
> > > return ERR_PTR(err);
> > >
> > > if (event->attr.defer_callchain)
> > > init_task_work(&event->pending_unwind_work,
> > > - perf_event_callchain_deferred);
> > > + perf_event_deferred_task);
> >
> > And you can remove here.
>
> There's nothing wrong with always initializing it. It will just never be
> called.
Ok.
>
> What situation do we have where cpu is negative? What's the perf command?
> Is there one?
Yep, there's --per-thread option for just per-task events.
Thanks,
Namhyung
next prev parent reply other threads:[~2025-05-02 20:16 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-01 1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
2025-05-01 1:32 ` [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-05-01 1:32 ` [PATCH v6 2/5] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-05-01 1:32 ` [PATCH v6 3/5] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-05-01 1:32 ` [PATCH v6 4/5] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-05-01 1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
2025-05-01 20:14 ` Namhyung Kim
2025-05-01 20:57 ` Steven Rostedt
2025-05-02 20:16 ` Namhyung Kim [this message]
2025-05-01 20:20 ` Namhyung Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aBUoF8DyzqmiW5vk@google.com \
--to=namhyung@kernel.org \
--cc=aahringo@redhat.com \
--cc=acme@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=andrii.nakryiko@gmail.com \
--cc=beaub@linux.microsoft.com \
--cc=blakejones@google.com \
--cc=broonie@kernel.org \
--cc=fweimer@redhat.com \
--cc=indu.bhagat@oracle.com \
--cc=irogers@google.com \
--cc=jemarch@gnu.org \
--cc=jolsa@kernel.org \
--cc=jordalgo@meta.com \
--cc=jpoimboe@kernel.org \
--cc=jremus@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=linux-toolchains@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mark.rutland@arm.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sam@gentoo.org \
--cc=wnliu@google.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.