* [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat @ 2024-01-19 16:39 Ben Gainey 2024-01-19 16:39 ` [PATCH 1/1] perf: " Ben Gainey ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Ben Gainey @ 2024-01-19 16:39 UTC (permalink / raw) To: linux-perf-users, linux-kernel Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter, james.clark, Ben Gainey This change allows events to use PERF_SAMPLE READ with inherit so long as both inherit_stat and PERF_SAMPLE_TID are set. Currently it is not possible to use PERF_SAMPLE_READ with inherit. This restriction assumes the user is interested in collecting aggregate statistics as per `perf stat`. It prevents a user from collecting per-thread samples using counter groups from a multi-threaded or multi-process application, as with `perf record -e '{....}:S'`. Instead users must use system-wide mode, or forgo the ability to sample counter groups. System-wide mode is often problematic as it requires specific permissions (no CAP_PERFMON / root access), or may lead to capture of significant amounts of extra data from other processes running on the system. Perf already supports the ability to collect per-thread counts with `inherit` via the `inherit_stat` flag. This patch changes `perf_event_alloc` relaxing the restriction to combine `inherit` with `PERF_SAMPLE_READ` so that the combination will be allowed so long as `inherit_stat` and `PERF_SAMPLE_TID` are enabled. In this configuration stream ids (such as may appear in the read_format field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather the pair of (stream id, tid) uniquely identify each event. Tools that rely on this, for example to calculate a delta between samples, would need updating to take this into account. Previously valid event configurations (system-wide, no-inherit and so on) where each stream id is the identifier are unaffected. This patch has been tested on aarch64 both my manual inspection of the output of `perf script -D` and through a modified version of Arm's commercial profiling tools and the numbers appear to line up as one would expect, but some further validation across other architectures and/or edge cases would be welcome. This patch was developed and tested on top of v6.7. Ben Gainey (1): perf: Support PERF_SAMPLE_READ with inherit_stat kernel/events/core.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/1] perf: Support PERF_SAMPLE_READ with inherit_stat 2024-01-19 16:39 [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat Ben Gainey @ 2024-01-19 16:39 ` Ben Gainey 2024-01-19 17:45 ` [PATCH 0/1] " Andi Kleen 2024-01-20 0:49 ` Namhyung Kim 2 siblings, 0 replies; 7+ messages in thread From: Ben Gainey @ 2024-01-19 16:39 UTC (permalink / raw) To: linux-perf-users, linux-kernel Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter, james.clark, Ben Gainey This change allows events to use PERF_SAMPLE READ with inherit so long as both inherit_stat and PERF_SAMPLE_TID are set. In this configuration, and event will be inherited into any child processes / threads, allowing convenient profiling of a multi-process or multi-threaded application, whilst allowing profiling tools to collect per-thread samples, in particular of groups of counters. Signed-off-by: Ben Gainey <ben.gainey@arm.com> --- kernel/events/core.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 9efd0d7775e7c..4b603463d888f 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11988,10 +11988,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, local64_set(&hwc->period_left, hwc->sample_period); /* - * We currently do not support PERF_SAMPLE_READ on inherited events. + * We do not support PERF_SAMPLE_READ on inherited events unless + * inherit_stat and PERF_SAMPLE_TID are also selected, which allows + * inherited events to collect per-thread samples. * See perf_output_read(). */ - if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)) + if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ) + && !(attr->inherit_stat && (attr->sample_type & PERF_SAMPLE_TID))) goto err_ns; if (!has_branch_stack(event)) -- 2.43.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat 2024-01-19 16:39 [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat Ben Gainey 2024-01-19 16:39 ` [PATCH 1/1] perf: " Ben Gainey @ 2024-01-19 17:45 ` Andi Kleen 2024-01-19 18:08 ` Ben Gainey 2024-01-20 0:49 ` Namhyung Kim 2 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2024-01-19 17:45 UTC (permalink / raw) To: Ben Gainey Cc: linux-perf-users, linux-kernel, peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter, james.clark Ben Gainey <ben.gainey@arm.com> writes: > In this configuration stream ids (such as may appear in the read_format > field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather > the pair of (stream id, tid) uniquely identify each event. Tools that > rely on this, for example to calculate a delta between samples, would > need updating to take this into account. Previously valid event > configurations (system-wide, no-inherit and so on) where each stream id > is the identifier are unaffected. So is this an ABI break? It might need an optin, if it breaks anything, which wouldn't surprise me. We do have a lot of different perf stream parsers around these days and we cannot break them. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat 2024-01-19 17:45 ` [PATCH 0/1] " Andi Kleen @ 2024-01-19 18:08 ` Ben Gainey 2024-01-19 22:55 ` Andi Kleen 0 siblings, 1 reply; 7+ messages in thread From: Ben Gainey @ 2024-01-19 18:08 UTC (permalink / raw) To: ak@linux.intel.com Cc: irogers@google.com, alexander.shishkin@linux.intel.com, James Clark, peterz@infradead.org, Mark Rutland, linux-perf-users@vger.kernel.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, namhyung@kernel.org, adrian.hunter@intel.com On Fri, 2024-01-19 at 09:45 -0800, Andi Kleen wrote: > Ben Gainey <ben.gainey@arm.com> writes: > > > In this configuration stream ids (such as may appear in the > > read_format > > field of a PERF_RECORD_SAMPLE) are no longer globally unique, > > rather > > the pair of (stream id, tid) uniquely identify each event. Tools > > that > > rely on this, for example to calculate a delta between samples, > > would > > need updating to take this into account. Previously valid event > > configurations (system-wide, no-inherit and so on) where each > > stream id > > is the identifier are unaffected. > > So is this an ABI break? It might need an optin, if it breaks > anything, > which wouldn't surprise me. We do have a lot of different perf stream > parsers around these days and we cannot break them. > > -Andi I had considered that, but given currently this perf_event_attr configuration is not allowed, I assumed that it would require existing tools to add support which would in effect be an opt-in. Of course, adding a new flag to be explicit would be trivial enough if required. That said, the binary format for the mmap records / read() etc does not change so using an unmodified tool to parse the data file will give bad results. Therefore any workflow where "modified recording tool" can be combined with "older / unmodified parsing tool" will break. Not sure of the best way to handle this... presumably whenever a change is made to the perf record format, any workflow that allows old parsers to read new format data without version checks could fail? Admittedly this is a "looks the same but isn't" change so harder for tools devs to spot. Any suggestions? For the perf tools, is there a means to record in perf.data a minimum supported tool version / feature incompatibility flags? Regards Ben IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat 2024-01-19 18:08 ` Ben Gainey @ 2024-01-19 22:55 ` Andi Kleen 0 siblings, 0 replies; 7+ messages in thread From: Andi Kleen @ 2024-01-19 22:55 UTC (permalink / raw) To: Ben Gainey Cc: irogers@google.com, alexander.shishkin@linux.intel.com, James Clark, peterz@infradead.org, Mark Rutland, linux-perf-users@vger.kernel.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, namhyung@kernel.org, adrian.hunter@intel.com > I had considered that, but given currently this perf_event_attr > configuration is not allowed, I assumed that it would require existing > tools to add support which would in effect be an opt-in. Of course, > adding a new flag to be explicit would be trivial enough if required. That's fair. Makes sense. > That said, the binary format for the mmap records / read() etc does not > change so using an unmodified tool to parse the data file will give bad > results. Therefore any workflow where "modified recording tool" can be > combined with "older / unmodified parsing tool" will break. Not sure of > the best way to handle this... presumably whenever a change is made to > the perf record format, any workflow that allows old parsers to read > new format data without version checks could fail? Admittedly this is a > "looks the same but isn't" change so harder for tools devs to spot. Any > suggestions? For perf itself we can find something. It does a couple of checks, like reserved bits in the perf_event_attr. For the general case of other parsers it's unclear. I suppose could increment the magic identifier to PERFILE3 -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat 2024-01-19 16:39 [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat Ben Gainey 2024-01-19 16:39 ` [PATCH 1/1] perf: " Ben Gainey 2024-01-19 17:45 ` [PATCH 0/1] " Andi Kleen @ 2024-01-20 0:49 ` Namhyung Kim 2024-01-20 16:14 ` Ben Gainey 2 siblings, 1 reply; 7+ messages in thread From: Namhyung Kim @ 2024-01-20 0:49 UTC (permalink / raw) To: Ben Gainey Cc: linux-perf-users, linux-kernel, peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark Hello, On Fri, Jan 19, 2024 at 8:39 AM Ben Gainey <ben.gainey@arm.com> wrote: > > This change allows events to use PERF_SAMPLE READ with inherit so long > as both inherit_stat and PERF_SAMPLE_TID are set. > > Currently it is not possible to use PERF_SAMPLE_READ with inherit. This > restriction assumes the user is interested in collecting aggregate > statistics as per `perf stat`. It prevents a user from collecting > per-thread samples using counter groups from a multi-threaded or > multi-process application, as with `perf record -e '{....}:S'`. Instead > users must use system-wide mode, or forgo the ability to sample counter > groups. System-wide mode is often problematic as it requires specific > permissions (no CAP_PERFMON / root access), or may lead to capture of > significant amounts of extra data from other processes running on the > system. > > Perf already supports the ability to collect per-thread counts with > `inherit` via the `inherit_stat` flag. This patch changes > `perf_event_alloc` relaxing the restriction to combine `inherit` with > `PERF_SAMPLE_READ` so that the combination will be allowed so long as > `inherit_stat` and `PERF_SAMPLE_TID` are enabled. I'm not sure if it's correct. Maybe I misunderstand inherit_stat but AFAIK it's just to use prev_task's events when next_task has the compatible event context. So the event values it sees in samples would depend on the timing or scheduler behavior. Also event counts and time values PERF_SAMPLE_READ sees include child event's so the values of the parent event can be updated even if it's inactive. And the values will vary for the next_task whether prev_task is the parent or not. I think it would return consistent values only if it iterates all child events and sums up the values like it does for read(2). But it cannot do that in the NMI handler. Frankly I don't understand how inherit_stat supports per-thread counts properly. Also it doesn't seem to be used by default in the perf tools. IIUC per-thread count is supported when you don't set the inherit bit and open separate events for each thread but I guess that's not what you want. Anyway, I'm ok with the idea of using PERF_SAMPLE_READ to improve per-thread profiling especially with event groups. But I think it should not use inherit_stat and it needs a way to not include child stats in the samples. What do you think? Thanks, Namhyung > > In this configuration stream ids (such as may appear in the read_format > field of a PERF_RECORD_SAMPLE) are no longer globally unique, rather > the pair of (stream id, tid) uniquely identify each event. Tools that > rely on this, for example to calculate a delta between samples, would > need updating to take this into account. Previously valid event > configurations (system-wide, no-inherit and so on) where each stream id > is the identifier are unaffected. > > This patch has been tested on aarch64 both my manual inspection of the > output of `perf script -D` and through a modified version of Arm's > commercial profiling tools and the numbers appear to line up as one > would expect, but some further validation across other architectures > and/or edge cases would be welcome. > > This patch was developed and tested on top of v6.7. > > > Ben Gainey (1): > perf: Support PERF_SAMPLE_READ with inherit_stat > > kernel/events/core.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat 2024-01-20 0:49 ` Namhyung Kim @ 2024-01-20 16:14 ` Ben Gainey 0 siblings, 0 replies; 7+ messages in thread From: Ben Gainey @ 2024-01-20 16:14 UTC (permalink / raw) To: namhyung@kernel.org Cc: alexander.shishkin@linux.intel.com, James Clark, peterz@infradead.org, Mark Rutland, linux-perf-users@vger.kernel.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com On Fri, 2024-01-19 at 16:49 -0800, Namhyung Kim wrote: > Hello, > > On Fri, Jan 19, 2024 at 8:39 AM Ben Gainey <ben.gainey@arm.com> > wrote: > > > > This change allows events to use PERF_SAMPLE READ with inherit so > > long > > as both inherit_stat and PERF_SAMPLE_TID are set. > > > > Currently it is not possible to use PERF_SAMPLE_READ with inherit. > > This > > restriction assumes the user is interested in collecting aggregate > > statistics as per `perf stat`. It prevents a user from collecting > > per-thread samples using counter groups from a multi-threaded or > > multi-process application, as with `perf record -e '{....}:S'`. > > Instead > > users must use system-wide mode, or forgo the ability to sample > > counter > > groups. System-wide mode is often problematic as it requires > > specific > > permissions (no CAP_PERFMON / root access), or may lead to capture > > of > > significant amounts of extra data from other processes running on > > the > > system. > > > > Perf already supports the ability to collect per-thread counts with > > `inherit` via the `inherit_stat` flag. This patch changes > > `perf_event_alloc` relaxing the restriction to combine `inherit` > > with > > `PERF_SAMPLE_READ` so that the combination will be allowed so long > > as > > `inherit_stat` and `PERF_SAMPLE_TID` are enabled. > > I'm not sure if it's correct. Maybe I misunderstand inherit_stat but > AFAIK it's just to use prev_task's events when next_task has the > compatible event context. So the event values it sees in samples > would depend on the timing or scheduler behavior. I think you are referring to __perf_event_sync_stat as called from perf_event_context_sched_out, but isn't the point that perf_event_context_sched_out will just swap the tasks and RCU pointers as an optmisation when the incoming context is the same as the outgoing context (rather than stopping and restarting), and so when inherit_stat is used, along with swaping the outgong and incoming task data in the event, it also reads and then swaps the outgoing and incoming counter values in the event so that the counter values that belong to the outgoing event are correctly associated to that task. Looking again at perf_event_context_sched_out though, one thing i note is that the call to perf_event_sync_stat happens after the call to perf_ctx_enable, which I think means two things: * The pmu->read call in the sync will be operating on an actively counting PMU, so the counts recorded in each event in the context may be skewed relative to each other. This is not the case elsewhere so that the events on the PMU are read in the disabled state. * perc_ctx_enable (at least for arm_pmu) will not reload the the incoming "remaining" period for any sampling events so its possible that an overflow would happen sooner than expected in the incoming context (though i don't think this will leave a corrupted count, just a smaller than expected count for that sample). > > Also event counts and time values PERF_SAMPLE_READ sees > include child event's so the values of the parent event can be > updated even if it's inactive. And the values will vary for the > next_task whether prev_task is the parent or not. I think it > would return consistent values only if it iterates all child events > and sums up the values like it does for read(2). But it cannot > do that in the NMI handler. > > Frankly I don't understand how inherit_stat supports per-thread > counts properly. Also it doesn't seem to be used by default in > the perf tools. Hmmm, ok rereading through core.c, i think the thing I have missed is the interaction between perf_event_count and sync_child_event, which I guess I missed when testing this because IIRC sync_child_event is only called on migration, hotplug and task exit events. I don't think it would be sensible to skip the update to child_count in sync_child_event as this would break the behaviour of the `read()` on an event's fd. I suppose the better thing to do would be to have perf_output_read_group/_one avoid reading child_count for these kinds of events. That would ensure the mmap sample is correct. > IIUC per-thread count is supported when you > don't set the inherit bit and open separate events for each > thread but I guess that's not what you want. You can take that approach... but I don't think it is pleasent :-). If you have an application that spawns threads at runtime the tool must somehow track them (perhaps by polling proc or using ptrace). Polling is not ideal as it can miss things or introduce high overhead in the tool. More importantly, creating a new event per thread can consume a lot of file descriptors, particularly if you open per-thread-per-cpu events to have them write to the same mmap. FWIW I recently prototyped a version of this in Arm's profiler tools, where we are also prototyping support for per-function metrics... On something like a Gravaton 3, having 64 cores, where the full set of published metrics for Neoverse-V1 contains something like 60 PMU events, tracing a database application that creates a new thread per connection, it spawned ~30 threads... the tool tried to create ~100k perf events... hit ulimit and terminated. I realize this annecdote is on the extreme end of things, but it is possible this would be a common issue for large core-count server systems. > > Anyway, I'm ok with the idea of using PERF_SAMPLE_READ to > improve per-thread profiling especially with event groups. > But I think it should not use inherit_stat and it needs a way to > not include child stats in the samples. > > What do you think? I'm happy to find a different way to opt in to this instead of `inherit_stat` if people prefer. Though I think if my understanding of __perf_event_sync_stat described above is correct, then so long as i fix the behaviour of perf_output_read_group/_one then I think the sampled counts would be correct. I can look at adding a new flag bit to opt in... this would probably also eleviate some of Andi's concerns in the other part of this thread. Open to suggestions otherwise... Thanks Ben > > Thanks, > Namhyung > > > > > In this configuration stream ids (such as may appear in the > > read_format > > field of a PERF_RECORD_SAMPLE) are no longer globally unique, > > rather > > the pair of (stream id, tid) uniquely identify each event. Tools > > that > > rely on this, for example to calculate a delta between samples, > > would > > need updating to take this into account. Previously valid event > > configurations (system-wide, no-inherit and so on) where each > > stream id > > is the identifier are unaffected. > > > > This patch has been tested on aarch64 both my manual inspection of > > the > > output of `perf script -D` and through a modified version of Arm's > > commercial profiling tools and the numbers appear to line up as one > > would expect, but some further validation across other > > architectures > > and/or edge cases would be welcome. > > > > This patch was developed and tested on top of v6.7. > > > > > > Ben Gainey (1): > > perf: Support PERF_SAMPLE_READ with inherit_stat > > > > kernel/events/core.c | 7 +++++-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > -- > > 2.43.0 > > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-01-20 16:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-19 16:39 [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat Ben Gainey 2024-01-19 16:39 ` [PATCH 1/1] perf: " Ben Gainey 2024-01-19 17:45 ` [PATCH 0/1] " Andi Kleen 2024-01-19 18:08 ` Ben Gainey 2024-01-19 22:55 ` Andi Kleen 2024-01-20 0:49 ` Namhyung Kim 2024-01-20 16:14 ` Ben Gainey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).