Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface

public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Josh Poimboeuf <jpoimboe@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: x86@kernel.org, Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@kernel.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	linux-kernel@vger.kernel.org,
	Indu Bhagat <indu.bhagat@oracle.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>,
	Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	linux-perf-users@vger.kernel.org, Mark Brown <broonie@kernel.org>,
	linux-toolchains@vger.kernel.org, Jordan Rome <jordalgo@meta.com>,
	Sam James <sam@gentoo.org>,
	linux-trace-kernel@vger.kernel.org,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Jens Remus <jremus@linux.ibm.com>,
	Florian Weimer <fweimer@redhat.com>,
	Andy Lutomirski <luto@kernel.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Weinan Liu <wnliu@google.com>
Subject: Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
Date: Wed, 22 Jan 2025 20:05:33 -0800	[thread overview]
Message-ID: <20250123040533.e7guez5drz7mk6es@jpoimboe> (raw)
In-Reply-To: <e92fc832-ef46-4863-9316-24f30d8d82fd@efficios.com>

On Wed, Jan 22, 2025 at 03:13:10PM -0500, Mathieu Desnoyers wrote:
> > +struct unwind_work {
> > +	struct callback_head		work;
> > +	unwind_callback_t		func;
> > +	int				pending;
> > +};
> 
> This is a lot of information to keep around per instance.
> 
> I'm not sure it would be OK to have a single unwind_work per perf-event
> for perf. I suspect it may need to be per perf-event X per-task if a
> perf-event can be associated to more than a single task (not sure ?).

For "perf record -g <command>", it seems to be one event per task.
Incidentally this is the mode where I did my perf testing :-/

But looking at it now, a global "perf record -g" appears to use one
event per CPU.  So if a task requests an unwind and then schedules out
before returning to user space, any subsequent tasks trying to unwind on
that CPU would be blocked until the original task returned to user.  So
yeah, that's definitely a problem.

Actually a per-CPU unwind_work descriptor could conceivably work if we
able to unwind at schedule() time.

But Steve pointed out that wouldn't work so well if the task isn't in
RUNNING state.

However... would it be a horrible idea for 'next' to unwind 'prev' after
the context switch???

> For LTTng, we'd have to consider something similar because of multi-session
> support. Either we'd have one unwind_work per-session X per-task, or we'd
> need to multiplex this internally within LTTng-modules. None of this is
> ideal in terms of memory footprint.
> 
> We should look at what part of this information can be made static/global
> and what part is task-local, so we minimize the amount of redundant data
> per-task (memory footprint).
> 
> AFAIU, most of that unwind_work information is global:
> 
>   - work,
>   - func,
> 
> And could be registered dynamically by the tracer when it enables
> tracing with an interest on stack walking.
> 
> At registration, we can allocate a descriptor ID (with a limited bounded
> max number, configurable). This would associate a work+func to a given
> ID, and keep track of this in a global table (indexed by ID).
> 
> I suspect that the only thing we really want to keep track of per-task
> is the pending bit, and what is the ID of the unwind_work associated.
> This could be kept, per-task, in either:
> 
> - a bitmap of pending bits, indexed by ID, or
> - an array of pending IDs.

That's basically what I was doing before.  The per-task state also had:

  - 'struct callback_head work' for doing the task work.  A single work
    function was used to multiplex the callbacks, as opposed to the
    current patches where each descriptor gets its own separate
    task_work.

  - 'void *privs[UNWIND_MAX_CALLBACKS]' opaque data pointers.  Maybe
    some callbacks don't need that, but perf needed it for the 'event'
    pointer.  For 32 max callbacks that's 256 bytes per task.

  - 'u64 last_cookies[UNWIND_MAX_CALLBACKS]' to prevent a callback from
    getting called twice.  But actually that may have been overkill, it
    should be fine to call the callback again with the cached stack
    trace.  The tracer could instead have its own policy for how to
    handle dupes.

  - 'unsigned int work_pending' to designate whether the task_work is
    pending.  Also probably not necessary, the pending bits could serve
    the same purpose.

So it had more concurrency to deal with, to handle the extra per-task
state.

It also had a global array of callbacks, which used a mutex and SRCU to
coordinate between the register/unregister and the task work.

Another major issue was that it wasn't NMI-safe due to all the shared
state.  So a tracer in NMI would have to schedule an IRQ to call
unwind_deferred_request().  Not only is that a pain for the tracers,
it's problematic in other ways:

  - If the NMI occurred in schedule() with IRQs disabled, the IRQ would
    actually interrupt the 'next' task.  So the caller would have to
    stash a 'task' pointer for the IRQ handler to read and pass to
    unwind_deferred_request().  (similar to the task_work bug I found)

  - Thus the deferred unwind interface would need to handle requests
    from non-current, introducing a new set of concurrency issues.

  - Also, while a tracer in NMI can unwind the kernel stack and send
    that to a ring buffer immediately, it can't store the cookie along
    with it, so there lie more tracer headaches.

Once I changed the interface to get rid of the global nastiness, all
those problems went away.

Of course that now introduces the new problem that each tracer (or
tracing event) needs some kind of per-task state.  But otherwise this
new interface really simplifies things a *lot*.

Anyway, I don't have a good answer at the moment.  Will marinate on it.

Maybe we could do something like allocate the unwind_work (or some
equivalent) on demand at the time of unwind request using GFP_NOWAIT or
GFP_ATOMIC or some such, then free it during the task work?

> Unregistration of unwind_work could iterate on all tasks and clear the
> pending bit or ID associated with the unregistered work, to make sure
> we don't trigger unrelated work after a re-use.

What the old unregister code did was to remove it from the global
callbacks array (with the careful use of mutex+SRCU to coordinate with
the task work).  Then synchronize_srcu() before returning.

> > +/*
> > + * The context cookie is a unique identifier which allows post-processing to
> > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> > + * id; the lower 48 bits are a per-CPU entry counter.
> > + */
> > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > 65535);
> 
> 2^12 = 4k, not 64k. Perhaps you mean to reserve 16 bits
> for cpu numbers ?

Yeah, here the code is right but the comment is wrong.  It actually does
use 16 bits.

> > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> 
> Perhaps use ilog2(NR_CPUS) instead for the number of bits to use
> rather than hard code 12 ?

I'm thinking I'd rather keep it simple by hard-coding the # of bits, so
as to avoid any surprises caused by edge cases.

-- 
Josh

next prev parent reply	other threads:[~2025-01-23  4:05 UTC|newest]

Thread overview: 161+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
2025-01-22 12:28   ` Peter Zijlstra
2025-01-22 20:47     ` Josh Poimboeuf
2025-01-23  8:14       ` Peter Zijlstra
2025-01-23 17:15         ` Josh Poimboeuf
2025-01-23 22:19           ` Peter Zijlstra
2025-04-22 16:14     ` Steven Rostedt
2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
2025-01-22 12:23   ` Peter Zijlstra
2025-01-22 12:42   ` Peter Zijlstra
2025-01-22 21:03     ` Josh Poimboeuf
2025-01-22 22:14       ` Josh Poimboeuf
2025-01-23  8:15         ` Peter Zijlstra
2025-04-22 16:15     ` Steven Rostedt
2025-04-22 17:20       ` Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 03/39] mm: Add guard for mmap_read_lock Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 04/39] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
2025-01-24 16:08   ` Jens Remus
2025-01-24 16:47     ` Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 06/39] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 07/39] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 08/39] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
2025-01-24 16:00   ` Jens Remus
2025-01-24 16:43     ` Josh Poimboeuf
2025-01-24 16:53       ` Josh Poimboeuf
2025-04-22 17:44       ` Steven Rostedt
2025-01-24 16:30   ` Jens Remus
2025-01-24 16:56     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 10/39] x86/uaccess: Add unsafe_copy_from_user() implementation Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
2025-01-24 16:41   ` Jens Remus
2025-01-24 17:09     ` Josh Poimboeuf
2025-01-24 17:59   ` Andrii Nakryiko
2025-01-24 18:08     ` Josh Poimboeuf
2025-01-24 20:02   ` Steven Rostedt
2025-01-24 22:05     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 12/39] unwind_user: Add frame pointer support Josh Poimboeuf
2025-01-24 17:59   ` Andrii Nakryiko
2025-01-24 18:16     ` Josh Poimboeuf
2025-04-24 13:41       ` Steven Rostedt
2025-01-22  2:31 ` [PATCH v4 13/39] unwind_user/x86: Enable frame pointer unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
2025-01-22 12:51   ` Peter Zijlstra
2025-01-22 21:37     ` Josh Poimboeuf
2025-01-24 20:09   ` Steven Rostedt
2025-01-24 22:06     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 15/39] unwind_user: Add compat mode frame pointer support Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 16/39] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers Josh Poimboeuf
2025-01-24 18:00   ` Andrii Nakryiko
2025-01-24 19:21     ` Josh Poimboeuf
2025-01-24 20:13       ` Steven Rostedt
2025-01-24 22:39         ` Josh Poimboeuf
2025-01-24 22:13       ` Indu Bhagat
2025-01-28  1:10         ` Andrii Nakryiko
2025-01-29  2:02           ` Josh Poimboeuf
2025-01-30  0:02             ` Andrii Nakryiko
2025-02-04 18:26               ` Josh Poimboeuf
2025-01-30 21:39             ` Indu Bhagat
2025-02-05  0:57               ` Josh Poimboeuf
2025-02-06  1:10                 ` Indu Bhagat
2025-02-05 13:56             ` Jens Remus
2025-02-07 21:13               ` Josh Poimboeuf
2025-01-30 21:21           ` Indu Bhagat
2025-02-04 19:59             ` Josh Poimboeuf
2025-02-05 23:16             ` Andrii Nakryiko
2025-02-05 11:01           ` Jens Remus
2025-02-05 23:05             ` Andrii Nakryiko
2025-01-24 20:31     ` Indu Bhagat
2025-01-22  2:31 ` [PATCH v4 18/39] unwind_user/sframe: Store sframe section data in per-mm maple tree Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
2025-01-24 16:36   ` Jens Remus
2025-01-24 17:07     ` Josh Poimboeuf
2025-01-24 18:02   ` Andrii Nakryiko
2025-01-24 21:41     ` Josh Poimboeuf
2025-01-28  0:39       ` Andrii Nakryiko
2025-01-28 10:50         ` Jens Remus
2025-01-29  2:04           ` Josh Poimboeuf
2025-01-28 10:54         ` Jens Remus
2025-01-30 19:51       ` Weinan Liu
2025-02-04 19:42         ` Josh Poimboeuf
2025-01-30 15:07   ` Indu Bhagat
2025-02-04 18:38     ` Josh Poimboeuf
2025-01-30 15:47   ` Jens Remus
2025-02-04 18:51     ` Josh Poimboeuf
2025-02-05  9:47       ` Jens Remus
2025-02-07 21:06         ` Josh Poimboeuf
2025-02-10 15:56           ` Jens Remus
2025-01-22  2:31 ` [PATCH v4 20/39] unwind_user/sframe: Detect .sframe sections in executables Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 21/39] unwind_user/sframe: Add prctl() interface for registering .sframe sections Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 22/39] unwind_user/sframe: Wire up unwind_user to sframe Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 23/39] unwind_user/sframe/x86: Enable sframe unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 24/39] unwind_user/sframe: Remove .sframe section on detected corruption Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output Josh Poimboeuf
2025-01-30 16:17   ` Jens Remus
2025-02-04 19:10     ` Josh Poimboeuf
2025-02-05 10:04       ` Jens Remus
2025-01-22  2:31 ` [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions Josh Poimboeuf
2025-01-30 16:38   ` Jens Remus
2025-02-04 19:33     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 27/39] unwind_user/sframe: Add .sframe validation option Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
2025-01-22 13:37   ` Peter Zijlstra
2025-01-22 14:16     ` Peter Zijlstra
2025-01-22 22:51       ` Josh Poimboeuf
2025-01-23  8:17         ` Peter Zijlstra
2025-01-23 18:30           ` Josh Poimboeuf
2025-01-23 21:58             ` Peter Zijlstra
2025-01-22 21:38     ` Josh Poimboeuf
2025-01-22 13:44   ` Peter Zijlstra
2025-01-22 21:52     ` Josh Poimboeuf
2025-01-22 20:13   ` Mathieu Desnoyers
2025-01-23  4:05     ` Josh Poimboeuf [this message]
2025-01-23  8:25       ` Peter Zijlstra
2025-01-23 18:43         ` Josh Poimboeuf
2025-01-23 22:13           ` Peter Zijlstra
2025-01-24 21:58             ` Steven Rostedt
2025-01-24 22:46               ` Josh Poimboeuf
2025-01-24 22:50                 ` Josh Poimboeuf
2025-01-24 23:57                   ` Steven Rostedt
2025-01-30 20:21                     ` Steven Rostedt
2025-02-05  2:25                       ` Josh Poimboeuf
2025-01-24 16:35   ` Jens Remus
2025-01-24 16:57     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 29/39] unwind_user/deferred: Add unwind cache Josh Poimboeuf
2025-01-22 13:57   ` Peter Zijlstra
2025-01-22 22:36     ` Josh Poimboeuf
2025-01-23  8:31       ` Peter Zijlstra
2025-01-23 18:45         ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
2025-01-22 14:15   ` Peter Zijlstra
2025-01-22 22:49     ` Josh Poimboeuf
2025-01-23  8:40       ` Peter Zijlstra
2025-01-23 19:48         ` Josh Poimboeuf
2025-01-23 19:54           ` Josh Poimboeuf
2025-01-23 22:17           ` Peter Zijlstra
2025-01-23 23:34             ` Josh Poimboeuf
2025-01-24 11:58               ` Peter Zijlstra
2025-01-22 14:24   ` Peter Zijlstra
2025-01-22 22:52     ` Josh Poimboeuf
2025-01-23  8:42       ` Peter Zijlstra
2025-01-22  2:31 ` [PATCH v4 31/39] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
2025-01-24 18:13   ` Andrii Nakryiko
2025-01-24 22:00     ` Josh Poimboeuf
2025-01-28  0:39       ` Andrii Nakryiko
2025-01-22  2:31 ` [PATCH v4 33/39] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 34/39] perf: Skip user unwind if !current->mm Josh Poimboeuf
2025-01-22 14:29   ` Peter Zijlstra
2025-01-22 23:08     ` Josh Poimboeuf
2025-01-23  8:44       ` Peter Zijlstra
2025-01-22  2:31 ` [PATCH v4 35/39] perf: Support deferred user callchains Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 36/39] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 37/39] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 38/39] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 39/39] perf tools: Merge deferred user callchains Josh Poimboeuf
2025-01-22  2:35 ` [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
2025-01-22 16:13 ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250123040533.e7guez5drz7mk6es@jpoimboe \
    --to=jpoimboe@kernel.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=broonie@kernel.org \
    --cc=fweimer@redhat.com \
    --cc=indu.bhagat@oracle.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=jordalgo@meta.com \
    --cc=jremus@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-toolchains@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sam@gentoo.org \
    --cc=wnliu@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox